Was the Database issue where it could no longer find it, a MariaDB issue or network routing issue where once it got exposed onto the network, routing fuck ups occurred or do we not know?
The fuck happened yesterday - short recap
Thx for the Post Mortem!
What i would suggest:
-
Make "critical" changes when you guys have time to monitor/support, even if that means ppl cannot play (once a week to balance it out?)
-- There is always something that needs attentions after an update -
It would be nice if we got any info about the current state (Discord), some players could start games where others could not and nobody was saying "we know about issues and we are on it, we will post an update in x hours"
-
Give the http-client a lil love, it's creating so many issues if the server is not responding/behaving as expected.
- noticed that when the replay server was not available, it crashed the client bc. of an unhandled exception.
PS: You guys still looking for support to migrate?
@noc said in The fuck happened yesterday - short recap:
Was the Database issue where it could no longer find it, a MariaDB issue or network routing issue where once it got exposed onto the network, routing fuck ups occurred or do we not know?
We do not know. But we had errors on OS level saying too many files opened. This would explain that you can't open new connections while running applications keep theirs alive. Doesn't match whatever happened with IRC because we had new connections a lot but they instantly died again.
@clint089 said in The fuck happened yesterday - short recap:
Thx for the Post Mortem!
What i would suggest:
- Make "critical" changes when you guys have time to monitor/support, even if that means ppl cannot play (once a week to balance it out?)
-- There is always something that needs attentions after an update
The irony here is that both changes independently would have been no brainers. If you had asked me beforehand I would've said nothing can go wrong here.
- It would be nice if we got any info about the current state (Discord), some players could start games where others could not and nobody was saying "we know about issues and we are on it, we will post an update in x hours"
We mostly try to do that. Unfortunately in this case all status check said our services work. And it takes some time to read the feedback aka "shit's on fire yo" and to also confirm what is actually up.
- Give the http-client a lil love, it's creating so many issues if the server is not responding/behaving as expected.
- noticed that when the replay server was not available, it crashed the client bc. of an unhandled exception.
I have no clue what you mean by http client here, but that sounds like client department
PS: You guys still looking for support to migrate?
The next milestone is gettings ZFS volumes with OpenEBS running. Support on that would be nice, but it's a very niche, I doubt we have experts here.
"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
ā Benno Rice
Thanks Brutus5000. I'm sure it was a hair pulling out moment.
Mizer
First we will be best, then we will be first.
Never argue with an idiot, they drag you down to their level and then beat you with experience.
Honestly this is not the first time I've heard of an issue with too many open file handles. I'm barely even a Linux admin but occasionally babysit an appliance at work and I've had to increase the max file handles before.
using it in daily work for ~ a year, I used it for ~4 years and got certification, so we use the tools we know
This makes sense, thanks for the explanation
Thanks for the explanation. I'm very thankful for the dedication and time you volunteer to keep this awesome game alive. I have no problem with you prioritising your personal time to run the updates, regardless of whether it is peak time or not. As long as it is communicated to the wider community for the weeks and days leading up to the restarts, there should be no issue. Long live FAF!
Thanks for taking the time to write this out! Big fan of transparency like this; mistakes and errors will always happen, and to acknowledge and discuss them like this is very healthy.
I support Clint's suggestion on update timing.