The fuck happened yesterday - short recap

19

As many of you may have noticed yesterday was a bad day for FAF.

Context

In the background we are currently working on a migration of services from Docker to Kubernetes (for almost two years by now actually...). Now we were in a state where we wanted to migrate the first services (in particular: the user service (user.faforever.com) and ory hydra (hydra.faforever.com). In order to do this we needed to make the database available in Kubernetes.

This however is tricky: in Docker every service has a hostname. faf-db is the name of our database. It also has an ip-adress but that ip address is not stable. The best way to make a docker service available for Kubernetes on the same host, is to expose the database on the host network. But right now the database is only available on the host from 127.0.0.1, not from inside the Kubernetes network. This required a change to the faf-db container and would have caused a downtime. As an alternative we use a tcp proxy bound to a different port. As a result a test version of our login services were working, while the database pointed to the proxy port. Now we planned expose the actual MariaDB port with the next server restart...

Another thing to know:
We manage all our Kubernetes secrets in a cloud service called infisical. You can managed secrets for multiple environment there, and changes are directly synced to the cluster. This simplifies handling a lot.

Yesterday morning

It all started with a seemingly well-known routine called server restart.
We had planned it because the server was running multiple months without restart aka unpatched Linux kernel.

So before work I applied the change and restarted the server.

Along with the restart we applied the change as described above: we made the MariaDB database port available for the for everybody on the network and not just 127.0.0.1. It is still protected via firewall, but this changed allowed it to use it from our internal K8s.

That actually worked well... or so I thought..

More Kubernetes testing

Now with the docker change in place I wanted to test if our login services now work on Kubernetes too. Unfortunately I made two changes which had much more impact than planned.
First I updated the connection string of the login service to use the new port. Secondly I absent-minded set the endpoint of the user service to match the official one so e.g. user.faforever.com now pointed to k8s. thirdly I set the environment to K8s because this shows up in the top left of the login screen for all places except production.

Now we have two pair of components running

A docker user service talking to a docker Ory Hydra
A K8s user service talking to a K8s Ory Hydra

What I wasn't aware (this is all new to us):

  1. If an app from Docker and an app from K8s compete for the same DNS record, the K8s app wins. So all users where pointed to the k8s user service talking to the K8s Ory Hydra.
  2. By changing the environment, I also changed the place, where our Kubernetes app "infisical" tries to download it's secrets. So now it pointed to an environment "K8s" which didn't exist and didn't have secrets. Thus the updated connection string could not be synced with K8s, leaving Ory Hydra with a broken connection string incapable of passing through logins.

So there were two different errors stacked on top of each other. Both difficult to find.

One fuckup rarely comes alone

Unfortunately in the meantime yet ANOTHER error occured. We assume that the operating system for some reason ran out of file descriptor or something causing weird errors, we are still unsure. The effect was this:

The docker side Ory Hydra was still running as usual. For whatever reason it could no longer reach the existing database, even after a restart. We have never seen that error before, and we still don't know what caused it.
Also the IRC was suddenly affected kicking users out of the system once it reached a critical mass, leading to permanent reconnects from all connected clients leading to even more file descriptors created...

So now we had stacked 3 errors stacked on top of each other and even rolling back didn't solve the problem.

This all happened during my worktime and made it very difficult to thoroughly understand what was going on or easily fix it.

But when we finally found the errors we could at least fix the login. But the IRC error persisted, so we shut it down until the next morning when less people tried to connect.

Conclusions

  1. The FAF client needs to stop instantly reconnect to IRC after a disconnect. There should be a waiting time with exponential backoff, to avoid overloading IRC. (It worked in the past, we didn't change it, we don't know why this is an issue now...)

  2. The parallel usage of Docker and Kubernetes is problematic and we need to intensify our efforts to move everything.

  3. More fuckups will happen because of 2., but we have to keep pushing.

  4. Most important: The idea to make a change when less users are online is nice, but it conflicts with my personal time. The server was in broken state for more than half a day because I didn't have time to investigate (work, kids). The alternative is, to make these changes when I have free time: at the peak time of faf around 21.00-23.00 CET. This affects more users, but shortens troubleshooting time. What do you think? Write in the comments.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

I don’t have much stake in the game when it comes to server updates during FAF peak hours since I’m rarely able to play during peak hours as it is, but I’d rather the people like you who are volunteering their time to keep this ship afloat be able to do what they need to do when it’s more convenient for them if possible. It’s not like you’re paid for this, and personally I’d rather you guys not feel burned out by feeling forced to give up personal time just for FAF’s convenience.

Appreciate the work and the update on what happened here!

Thanks for the postmortem, really appreciate it! Any reason you guys decided to go with k8s at all? I have to deal with it a lot at work and it's a tough bitch to manage, I'm surprised that such a small infrastructure as faf needs it. So wonder if there was anything besides it being a docker-compose alternative or whatnot

@brutus5000 said in The fuck happened yesterday - short recap:

The alternative is, to make these changes when I have free time: at the peak time of faf around 21.00-23.00 CET. This affects more users, but shortens troubleshooting time. What do you think?

Sounds good. It's just a game. If a few people get pushed out sometimes at peak busy time, that's better than having problems linger for 24+ hours at a time.

And if that's what you would prefer, even better. It's important that FAF devs/admins enjoy working on FAF.

Employers compensate for stress by paying people. The best we can offer is just to not make the project more stressful than it needs to be.

There's a plentitude of reasons why we're moving to K8s. At a first glimpse it sound counter-intuitive to run it on a single host. But you've got to start somewhere. Some of the benefits are:

  1. No more fiddling on the server itself. SSHing and running around with ssh and manual copying and modifying files everywhere is dangerous.
  2. Gitops. We commit to the repo, ArgoCD applies it. Also ArgoCD syncs differences between what is and what should be - a regular issue on our current setup.
  3. Empower more people. If you don't need ssh access to the server which must be restricted because of data protection, we can potentially engage more people to work on it.
  4. Reduce the difference between test and prod config by using Helm templating where usefull. E.g. instead of declaring different urls on test and prod, you just template the base domain, while the rest stays the same.
  5. Zero downtime updates. On docker if I update certain services, e.g. the user service there is always a downtime. With k8s the old keeps running until the new one is up. Can be done with docker only with massive additional work.
  6. Standardization. We have a lot of workflows with custom aliases and stuff on how we deal with docker, that nobody else knows. K8s is operated in a more standardized way, even though we deviate from the standard where we see benefits.
  7. Outlook. Right now we are single host. In a docker setup we'll always be, because services are declared in the yamls, volumes are mounted to a host etc. With K8s we have the flexibility to move at least some services elsewhere. We're not planning fix all of our single points of failures, but it will give us more flexibility.
  8. Ecosystem and extensibility. We have used the maximum of features out of Docker (Compose). Nothing more will come out of it. In K8s the journey has just begun and more and more stuff is added to it every day. From automated backups of volumes to enforcing of policies, virtual clusters for improved security. There's more usefull stuff coming up everyday.
  9. Declarative over imperative. Setting up the faf-stack docker compose relies heavily on scripts to setup e.g. databases, rabbitmq users, topics, ... In K8s all major services go declarative using operators. Whether it makes sense to use them for a single use needs to be decided on a case-by-case basis.

The list probably goes on, but you see we have enough reasons.
As for other alternatives: we looked into it, but for Kubernetes p4block is using it in daily work for ~ a year, I used it for ~4 years and got certification, so we use the tools we know 🙂

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

I think you are doing great job. Thank you for explanation and for putting your time into troubleshooting.
My opinion is that - if, as you described, changes are made not really often - it is better idea to do it when server administrators have time to do it, even if that is during FAF peak.
Some announcement on discord that there is some minor downtime expected should be enough.

imma just nod and pretend like i understood any of this, great work as always, time wise just do what you can when you can don't feel pushed family should come 1st

Vault Admin / Creative Team / Map Guru

Was the Database issue where it could no longer find it, a MariaDB issue or network routing issue where once it got exposed onto the network, routing fuck ups occurred or do we not know?

Ras Boi's save lives.

Thx for the Post Mortem!

What i would suggest:

  1. Make "critical" changes when you guys have time to monitor/support, even if that means ppl cannot play (once a week to balance it out?)
    -- There is always something that needs attentions after an update

  2. It would be nice if we got any info about the current state (Discord), some players could start games where others could not and nobody was saying "we know about issues and we are on it, we will post an update in x hours"

  3. Give the http-client a lil love, it's creating so many issues if the server is not responding/behaving as expected.

  • noticed that when the replay server was not available, it crashed the client bc. of an unhandled exception.

PS: You guys still looking for support to migrate?

@noc said in The fuck happened yesterday - short recap:

Was the Database issue where it could no longer find it, a MariaDB issue or network routing issue where once it got exposed onto the network, routing fuck ups occurred or do we not know?

We do not know. But we had errors on OS level saying too many files opened. This would explain that you can't open new connections while running applications keep theirs alive. Doesn't match whatever happened with IRC because we had new connections a lot but they instantly died again.

@clint089 said in The fuck happened yesterday - short recap:

Thx for the Post Mortem!

What i would suggest:

  1. Make "critical" changes when you guys have time to monitor/support, even if that means ppl cannot play (once a week to balance it out?)
    -- There is always something that needs attentions after an update

The irony here is that both changes independently would have been no brainers. If you had asked me beforehand I would've said nothing can go wrong here. 😉

  1. It would be nice if we got any info about the current state (Discord), some players could start games where others could not and nobody was saying "we know about issues and we are on it, we will post an update in x hours"

We mostly try to do that. Unfortunately in this case all status check said our services work. And it takes some time to read the feedback aka "shit's on fire yo" and to also confirm what is actually up.

  1. Give the http-client a lil love, it's creating so many issues if the server is not responding/behaving as expected.
  • noticed that when the replay server was not available, it crashed the client bc. of an unhandled exception.

I have no clue what you mean by http client here, but that sounds like client department 😄

PS: You guys still looking for support to migrate?

The next milestone is gettings ZFS volumes with OpenEBS running. Support on that would be nice, but it's a very niche, I doubt we have experts here.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

Thanks Brutus5000. I'm sure it was a hair pulling out moment.
Mizer

First we will be best, then we will be first.
Never argue with an idiot, they drag you down to their level and then beat you with experience.

Honestly this is not the first time I've heard of an issue with too many open file handles. I'm barely even a Linux admin but occasionally babysit an appliance at work and I've had to increase the max file handles before.

using it in daily work for ~ a year, I used it for ~4 years and got certification, so we use the tools we know

This makes sense, thanks for the explanation 🙂

Thanks for the explanation. I'm very thankful for the dedication and time you volunteer to keep this awesome game alive. I have no problem with you prioritising your personal time to run the updates, regardless of whether it is peak time or not. As long as it is communicated to the wider community for the weeks and days leading up to the restarts, there should be no issue. Long live FAF!

Thanks for taking the time to write this out! Big fan of transparency like this; mistakes and errors will always happen, and to acknowledge and discuss them like this is very healthy.

I support Clint's suggestion on update timing.

"Design is an iterative process. The necessary number of iterations is one more than the number you have currently done. This is true at any point in time."

Newest map: luminary.png

I appreciate your effort in putting this together! I'm a strong advocate for transparency, as acknowledging and addressing mistakes and errors in such a manner is a commendable practice.

I agree with your proposal regarding the timing of updates.

Yeah better to fix the issue when its convinient for you, after all its just a game, we can live without it for a couple of day if its needed

Skill issue