The fuck happened yesterday - short recap

valigo

Thanks for the postmortem, really appreciate it! Any reason you guys decided to go with k8s at all? I have to deal with it a lot at work and it's a tough bitch to manage, I'm surprised that such a small infrastructure as faf needs it. So wonder if there was anything besides it being a docker-compose alternative or whatnot

arma473

@brutus5000 said in The fuck happened yesterday - short recap:

The alternative is, to make these changes when I have free time: at the peak time of faf around 21.00-23.00 CET. This affects more users, but shortens troubleshooting time. What do you think?

Sounds good. It's just a game. If a few people get pushed out sometimes at peak busy time, that's better than having problems linger for 24+ hours at a time.

And if that's what you would prefer, even better. It's important that FAF devs/admins enjoy working on FAF.

Employers compensate for stress by paying people. The best we can offer is just to not make the project more stressful than it needs to be.

Brutus5000

There's a plentitude of reasons why we're moving to K8s. At a first glimpse it sound counter-intuitive to run it on a single host. But you've got to start somewhere. Some of the benefits are:

No more fiddling on the server itself. SSHing and running around with ssh and manual copying and modifying files everywhere is dangerous.
Gitops. We commit to the repo, ArgoCD applies it. Also ArgoCD syncs differences between what is and what should be - a regular issue on our current setup.
Empower more people. If you don't need ssh access to the server which must be restricted because of data protection, we can potentially engage more people to work on it.
Reduce the difference between test and prod config by using Helm templating where usefull. E.g. instead of declaring different urls on test and prod, you just template the base domain, while the rest stays the same.
Zero downtime updates. On docker if I update certain services, e.g. the user service there is always a downtime. With k8s the old keeps running until the new one is up. Can be done with docker only with massive additional work.
Standardization. We have a lot of workflows with custom aliases and stuff on how we deal with docker, that nobody else knows. K8s is operated in a more standardized way, even though we deviate from the standard where we see benefits.
Outlook. Right now we are single host. In a docker setup we'll always be, because services are declared in the yamls, volumes are mounted to a host etc. With K8s we have the flexibility to move at least some services elsewhere. We're not planning fix all of our single points of failures, but it will give us more flexibility.
Ecosystem and extensibility. We have used the maximum of features out of Docker (Compose). Nothing more will come out of it. In K8s the journey has just begun and more and more stuff is added to it every day. From automated backups of volumes to enforcing of policies, virtual clusters for improved security. There's more usefull stuff coming up everyday.
Declarative over imperative. Setting up the faf-stack docker compose relies heavily on scripts to setup e.g. databases, rabbitmq users, topics, ... In K8s all major services go declarative using operators. Whether it makes sense to use them for a single use needs to be decided on a case-by-case basis.

The list probably goes on, but you see we have enough reasons.
As for other alternatives: we looked into it, but for Kubernetes p4block is using it in daily work for ~ a year, I used it for ~4 years and got certification, so we use the tools we know

Caine

I think you are doing great job. Thank you for explanation and for putting your time into troubleshooting.
My opinion is that - if, as you described, changes are made not really often - it is better idea to do it when server administrators have time to do it, even if that is during FAF peak.
Some announcement on discord that there is some minor downtime expected should be enough.

MadMax

imma just nod and pretend like i understood any of this, great work as always, time wise just do what you can when you can don't feel pushed family should come 1st

NOC-

Was the Database issue where it could no longer find it, a MariaDB issue or network routing issue where once it got exposed onto the network, routing fuck ups occurred or do we not know?

clint089

Thx for the Post Mortem!

What i would suggest:

Make "critical" changes when you guys have time to monitor/support, even if that means ppl cannot play (once a week to balance it out?)
-- There is always something that needs attentions after an update
It would be nice if we got any info about the current state (Discord), some players could start games where others could not and nobody was saying "we know about issues and we are on it, we will post an update in x hours"
Give the http-client a lil love, it's creating so many issues if the server is not responding/behaving as expected.

noticed that when the replay server was not available, it crashed the client bc. of an unhandled exception.

PS: You guys still looking for support to migrate?

Brutus5000

@noc said in The fuck happened yesterday - short recap:

Was the Database issue where it could no longer find it, a MariaDB issue or network routing issue where once it got exposed onto the network, routing fuck ups occurred or do we not know?

We do not know. But we had errors on OS level saying too many files opened. This would explain that you can't open new connections while running applications keep theirs alive. Doesn't match whatever happened with IRC because we had new connections a lot but they instantly died again.

@clint089 said in The fuck happened yesterday - short recap:

Thx for the Post Mortem!

What i would suggest:

Make "critical" changes when you guys have time to monitor/support, even if that means ppl cannot play (once a week to balance it out?)
-- There is always something that needs attentions after an update

The irony here is that both changes independently would have been no brainers. If you had asked me beforehand I would've said nothing can go wrong here.

It would be nice if we got any info about the current state (Discord), some players could start games where others could not and nobody was saying "we know about issues and we are on it, we will post an update in x hours"

We mostly try to do that. Unfortunately in this case all status check said our services work. And it takes some time to read the feedback aka "shit's on fire yo" and to also confirm what is actually up.

Give the http-client a lil love, it's creating so many issues if the server is not responding/behaving as expected.

noticed that when the replay server was not available, it crashed the client bc. of an unhandled exception.

I have no clue what you mean by http client here, but that sounds like client department

PS: You guys still looking for support to migrate?

The next milestone is gettings ZFS volumes with OpenEBS running. Support on that would be nice, but it's a very niche, I doubt we have experts here.

PMBMizer

Thanks Brutus5000. I'm sure it was a hair pulling out moment.
Mizer

RedX

Honestly this is not the first time I've heard of an issue with too many open file handles. I'm barely even a Linux admin but occasionally babysit an appliance at work and I've had to increase the max file handles before.

valigo

using it in daily work for ~ a year, I used it for ~4 years and got certification, so we use the tools we know

This makes sense, thanks for the explanation

Jonnoiscooler

Thanks for the explanation. I'm very thankful for the dedication and time you volunteer to keep this awesome game alive. I have no problem with you prioritising your personal time to run the updates, regardless of whether it is peak time or not. As long as it is communicated to the wider community for the weeks and days leading up to the restarts, there should be no issue. Long live FAF!

IndexLibrorum

Thanks for taking the time to write this out! Big fan of transparency like this; mistakes and errors will always happen, and to acknowledge and discuss them like this is very healthy.

I support Clint's suggestion on update timing.

Khal

I appreciate your effort in putting this together! I'm a strong advocate for transparency, as acknowledging and addressing mistakes and errors in such a manner is a commendable practice.

I agree with your proposal regarding the timing of updates.

Tomma

Yeah better to fix the issue when its convinient for you, after all its just a game, we can live without it for a couple of day if its needed

The fuck happened yesterday - short recap

See all my projects: