TL;DR: if you need a cloud service you may want to avoid Hetzner.
The longer story:
On Sunday the 1st of December, at 00:00 (UTC), our main storage backend became entirely unreachable. For the average user that meant not being able to access the library and download files, and for us that meant not being able to connect to it and see what was wrong.
Turns out that Hetzner has decided to cancel our account and terminate all servers. There was no warning (yes, we checked our spam folder), and nobody could be reached before Monday morning. The cancellation having happened on the first day of the month at 00:00, we knew it was a scheduled event. But why?
Houston, we have a problem
Here is the slightly-panicked email that we sent on Sunday 1 around noon:
Procedure: L0020649F
> Person: [redacted] / Kiwix
> Cause: Hello,
>
> Starting this morning (December 1st at 00:00 UTC), our servers went down.
> We received zero email nor notification of any kind from you.
> Looking for a way to contact you, I looked into this Unlock tab that list an incident
> that matches the time the problem started.
>
> It’s been close to (12) hours already, without a single message from you. Our services
> are down.
>
> In the Robot dashboard, there is no server listed. In the Traffic statistics page, it
> says we have no IP.
> In the Cloud dashboard, we cant even enter, it says Access Denied.
>
> What’s going on? The billing page is reachable and it indicates we paid all our
> invoices and the next one is to come in 5 days. So it’s not a payment issue.
>
> I checked
> https://docs.hetzner.com/robot/dedicated-server/troubleshooting/guideline-in-case-of-server-locking/
>
> I am not sure if we’re locked because the traceroute does not lead to
> blocked.hetzner.com
> Because the server is not listed, we cant use the whitelist or any other tool.
>
> Please restore the service immediately.
> Please let us know what kind of issue there is if there is one.
>
> Only restoring SX65 #2453510 (135.181.224.247) is urgent. The two cloud ones can be
> sorted out later.
No response. But at this stage of our trouble we knew it was not a payment issue either.
This is not the response you are looking for
Our dashboard access got canceled altogether the next day, so we had to call Germany. But, when reached, they could not explain the reason for the cancellation:
“- We sent you an email.
– We did not receive it, can you please resend?
– We can not.”
ಠ_ಠ
In the meantime, all servers had been wiped already so no way to retrieve our data. And yes, we checked our spam folders.
If you are looking for a bad case of the Mondays, that was one.
Luckily we have mirrors and these were not affected. We grabbed a new machine somewhere else (Scaleway; if we name-and-shame the one we might as well name-and-greet the other) and immediately started re-importing our data to our new Master server. All in all, it still took about 48 hours to get these 8-ish TB back online.
What now?
If there is any silver lining to this, it is that we could see a few points of vulnerabilities as well as our ability to turn things around in a reasonably quick manner (here be kudos for the two heroes who manage our infra). We will see in the coming weeks/months how we can implement new safegards within our resource constraints.
We now have moved on (and away), but if we did something wrong that caused Hetzner to cancel our access it would be nice to know what it is so that we can make sure it does not happen again. The lack of communication is mind-boggling.
Hetzner, on the other hand, could maybe consider that since they are hosting business critical resources for their clients, then maybe sending a single email and consider they’ve done their part needs some reworking. Maybe.