As many of you may have noticed yesterday was a bad day for FAF.
Context
In the background we are currently working on a migration of services from Docker to Kubernetes (for almost two years by now actually...). Now we were in a state where we wanted to migrate the first services (in particular: the user service (user.faforever.com
) and ory hydra (hydra.faforever.com
). In order to do this we needed to make the database available in Kubernetes.
This however is tricky: in Docker every service has a hostname. faf-db
is the name of our database. It also has an ip-adress but that ip address is not stable. The best way to make a docker service available for Kubernetes on the same host, is to expose the database on the host network. But right now the database is only available on the host from 127.0.0.1, not from inside the Kubernetes network. This required a change to the faf-db
container and would have caused a downtime. As an alternative we use a tcp proxy bound to a different port. As a result a test version of our login services were working, while the database pointed to the proxy port. Now we planned expose the actual MariaDB port with the next server restart...
Another thing to know:
We manage all our Kubernetes secrets in a cloud service called infisical
. You can managed secrets for multiple environment there, and changes are directly synced to the cluster. This simplifies handling a lot.
Yesterday morning
It all started with a seemingly well-known routine called server restart.
We had planned it because the server was running multiple months without restart aka unpatched Linux kernel.
So before work I applied the change and restarted the server.
Along with the restart we applied the change as described above: we made the MariaDB database port available for the for everybody on the network and not just 127.0.0.1. It is still protected via firewall, but this changed allowed it to use it from our internal K8s.
That actually worked well... or so I thought..
More Kubernetes testing
Now with the docker change in place I wanted to test if our login services now work on Kubernetes too. Unfortunately I made two changes which had much more impact than planned.
First I updated the connection string of the login service to use the new port. Secondly I absent-minded set the endpoint of the user service to match the official one so e.g. user.faforever.com
now pointed to k8s. thirdly I set the environment to K8s
because this shows up in the top left of the login screen for all places except production.
Now we have two pair of components running
A docker user service talking to a docker Ory Hydra
A K8s user service talking to a K8s Ory Hydra
What I wasn't aware (this is all new to us):
- If an app from Docker and an app from K8s compete for the same DNS record, the K8s app wins. So all users where pointed to the k8s user service talking to the K8s Ory Hydra.
- By changing the environment, I also changed the place, where our Kubernetes app "infisical" tries to download it's secrets. So now it pointed to an environment "K8s" which didn't exist and didn't have secrets. Thus the updated connection string could not be synced with K8s, leaving Ory Hydra with a broken connection string incapable of passing through logins.
So there were two different errors stacked on top of each other. Both difficult to find.
One fuckup rarely comes alone
Unfortunately in the meantime yet ANOTHER error occured. We assume that the operating system for some reason ran out of file descriptor or something causing weird errors, we are still unsure. The effect was this:
The docker side Ory Hydra was still running as usual. For whatever reason it could no longer reach the existing database, even after a restart. We have never seen that error before, and we still don't know what caused it.
Also the IRC was suddenly affected kicking users out of the system once it reached a critical mass, leading to permanent reconnects from all connected clients leading to even more file descriptors created...
So now we had stacked 3 errors stacked on top of each other and even rolling back didn't solve the problem.
This all happened during my worktime and made it very difficult to thoroughly understand what was going on or easily fix it.
But when we finally found the errors we could at least fix the login. But the IRC error persisted, so we shut it down until the next morning when less people tried to connect.
Conclusions
-
The FAF client needs to stop instantly reconnect to IRC after a disconnect. There should be a waiting time with exponential backoff, to avoid overloading IRC. (It worked in the past, we didn't change it, we don't know why this is an issue now...)
-
The parallel usage of Docker and Kubernetes is problematic and we need to intensify our efforts to move everything.
-
More fuckups will happen because of 2., but we have to keep pushing.
-
Most important: The idea to make a change when less users are online is nice, but it conflicts with my personal time. The server was in broken state for more than half a day because I didn't have time to investigate (work, kids). The alternative is, to make these changes when I have free time: at the peak time of faf around 21.00-23.00 CET. This affects more users, but shortens troubleshooting time. What do you think? Write in the comments.