DevOps retro: Failure & issue compilation Q4/2020 - Q1/2021

14

It's quite some time since my last blog post and therefore I want to give you a little update on what has happened since then.

A lot of issues cascaded and blocked each other, so in many terms it didn't really feel like much progress, because nothing got really finished. But the story behind it might be interesting for you nevertheless.

So here comes the probably longest issue and failure compilation of FAF we ever had (and it's just the stuff I remember) šŸ˜‰

Travis CI terminating services
Travis CI is a "continuous integration" platform that we use in almost every Github repository to run automated tests and to build releases.

Unfortunately it looks like Travis CI is in financial trouble and they massively reduced the capacities for open source projects. Out of a sudden almost every repository needed to wait half a day or longer for a single test run or building a release, and in many cases even those failed.

As a solution we migrated most of the core repositories to Github Actions, which is not only faster but nicer to use in many cases. But depending on the CI pipeline complexity that was an effort of multiple evenings until everything was sorted out.

There are still repositories waiting to be migrated šŸ˜ž

GoLive of the matchmaker
So shortly before Christmas we reached the final phase of the matchmaker. In the hurry a few points were overlooked which caused some troubles:

  1. On releasing the public beta client release, we sort of derped on the version number. It doesn't follow the required version pattern. This leads to every user on this release will never ever again receive an update notification. Please update manually to the latest client release.
  2. Replay ratings: Since the dawn of time, FAF had always just 1 rating per game, because it was either Global Rating or Ladder Rating or no rating at all. Additionally global and ladder rating summaries were stored in 2 different tables.
    However having 1 table per leaderboard is not a healthy strategy if you plan to have many more of them. Long time before TMM we already moved to dynamic leaderboard based rating. Effectively this allows us to calculate more than one rating per game, but 1 rating per leaderboard per game (if we need to). Until the rollout of TMM the old tables were still populated though. This stopped with the rollout, effectively breaking the rating changes show in the replay view on the client.
    Also the API endpoints for fetching the correct replays where still missing and added on short notice.
  3. The official TMM client version also had issues on showing the right rating for the in game players. This led to maximum confusion whether the whole system worked and if ratings where actually accurately calculated. Eventually I'd say it was all correct, just the display was wrong.
  4. The replay vault search by rating was removed due to a misunderstanding between Sheikah and me. Things got complicated here as well: now you don't have the one rating to search for, but you have to select from which leaderboard you want to look at the ratings.
  5. Instead of looking into each leaderboard with multiple queries we tried to add the current leaderboard ratings for each player into the player object on API level.
    Unfortunately this creates a suboptimal query that maximized MySQL cpu usage within minutes. This lead to a few server crashes until we figured out what had happened.

Moving to MariaDB
The MySQL issue showed us once again that we want to move to a new database version. The current MySQL 5.7 is quite old.
This discussion is not new, actually it's already almost 2 years old but got deferred because there was no real pressure (and we couldn't really agree whether to go to MySQL 8 or MariaDB 10.x which are both compatible with MySQL 5.7 but incompatible with each other).

The server downtime caused by the suboptimal query as described above made me reconsider. We had multiple issues with MySQLs shitty way of doing a specific type of join (nested loops). So we tested this with MySQL 8 and MariaDB 10.5 on a copy of the production database.

There was a simple query that took 1min 25s (utilising the whole PC). MySQL 8 took just 5 seconds less. But MariaDB obviously fixed the underlying issues and returned a result in 1 second.

So this set the new path to migrate to MariaDB. A test migration from raw MySQL 5.7 without backup and restore ended up in a corrupted data set. So we'll need to to proper backup and restore.

Unfortunately MariaDB tripped over one of the migration files when creating the current database scheme. Therefore I tried to merge all migrations into a single database schema import (which took a few days to figure out).

Once this was done I still encountered the error with a better description just to figure out that this was due to a bug in MariaDB that was marked as fixed already but no release yet! So it had to be postponed once again...

It's finally released by now, and we'll need to restart some testing. This is probably part of the next server update.

Auto-Backups killed the server
Our new server uses ZFS as file system which is very modern and powerfull. When we setup the server we thought using ZFS for the Docker (the container daemon running all apps) container volumes was a good idea. It wasn't, instead it was polluting the system with ZFS datasets for each docker container layer (there were HUNDREDS of them). Even though this caused no direct harm, it prevented us from activating auto-backups.

With the last server update we fixed this and are now back to a proper dataset list:

NAME            USED  AVAIL     REFER  MOUNTPOINT
tank            888G  34.4G     3.93G  legacy
tank/docker    32.7G  34.4G     29.1G  -
tank/fafstack   160G  34.4G      158G  legacy
tank/mongodb    353M  34.4G      277M  legacy
tank/mysql     13.5G  34.4G     12.4G  legacy
tank/replays    677G  34.4G      677G  legacy

And then we thought activating auto-backup was a good idea. But it was not. Even though using snapshots should only consume the "delta" between the current and the previous state on disk MySQL alone took ~20GB for 5 snapshots. This brought the server disk into critical area, where it started mounting parts of the file system as read only.

The solution was to remove the backup snapshots and restart the server and work on the next topic:

Replay recompression
When we rented the new server in January 2020 we were thinking about how much disk space we need and what the additional cost for more storage would be. Back then we came to the conclusion that 1 TB would suffice, and once we run into problems we'd optimize the replays stored on the disk.

The replay format dates back into ancient times, where some weird decisions were made. A replay file contains of 1 lines of JSON describing the content of the replay. After the line break there is a base64 encoded deflate-compressed.

The deflate compression algorithm is ancient and very inefficient compared to modern algorithms.
Base64 encoding is a way to force any kind of binary data into a representation of the ASCII character set. This might have made sense 2013 for download compatibility, but I highly doubt it. The massive disadvantage is, that base64 encoding brings an overhead of around 30%.

Improving this replay format is a topic again, that is well known and older than 2 years and always scheduled for "can be done when needed". Even though there were code changes lying around for the replay server and the java client.

The new plan is instead of zip + base64 we will compress the replay stream with the Zstandard algorithm developed by Facebook (the best compression algorithm around).

Actually the latest client release 1.4.1 added support for the new replay format, as we already saw the server disk going full (but we expected to have more time).

With the server crash due to disk space we reprioritized this issue. Now with the client release out we prepared a script to re-compress all existing replays to Zstandard (currently I'm doing some last backups, but I intend to start it tommorow).
With test excerpts we could see a reduction of 40% disk space. This will free up at least 200GB on the server, making room for another 2-4 years of new replays.

Current estimation say it will take 10 days if we run it without parallelization.

The OAuth cascade of Doom
In my last blog I presented a new registration page. Unfortunately when trying to finish it I hit a wall on login to our API. It took me over 2 months to figure out the whole complexity of the problem.

When we started work on the FAF API we selected OAuth2 as a standardized, mature and safe protocol for login to the web. And that is sort of still the case. Then we build a whole authorisation server into our API that takes care of the whole set of login flows. Currently we use 2 of these: If you login via the website or the forum, you get redirected to an API login page and after login redirected back. This is called the "implicit flow". The java client however uses a classic username and password approach. This is known as the "password flow".

Following standards and using well known frameworks I expected to be able to connect to the API from my registration app very easily. But it didn't work out. Today I know: This is related to the fact that the concrete OAuth2 implementation we use is customized in a non-standard way (which I never even was aware of). There is a better way of doing it now, with a new standard on top of OAuth2 called OpenID Connect.

Unfortunately our API library is deprecated and it's successor dropped support for the authorization server alltogether. Therefore a replacement was needed eventually.

I was looking into the space of cloud offerings if there are any free services that could work out for FAF. But as it turns out FAF is huuuge with it's 250.000+ playerbase. Running this as a service in the cloud would cost us hundreds, maybe thousands of Euros per month. (Currently we spend just ~80ā‚¬ on the prod server total cost)

So we needed something self hosted. 2 candidates were evaluated: RedHat Keycloak and Ory Hydra.

RedHat Keycloak is the more mature product so I evaluated this first. After a week of experiments, it turned out to be completely unsuitable for FAF as it is built for enterprise where you e.g. need to give your firstname and lastname and can't select a username (or even change it).

Ory Hydra on the other hand is just one tool that does OAuth2 flows and nothing else and needs to be combined with your own software. So I wrote a new one with a new fancy login and everything (that's a whole story on its own) and after 2 months and a shitton of failures mostly related to fighting http vs https in an internal network I was capable to login to the test server.

But then I was stuck because the API needs to support the old way of login and the new way of login, as the tokens look different and the cryptographical signatures are different etc.. So I wrote a compatibility layer to support both.

But now we're stuck again. Ory Hydra takes security very seriously. And due to that they refuse to implement deprecated legac OAuth flows such as "password flow", which the FAF client uses...
So we need to implement a web-based login. But there is no library support out there because the open source world has moved away from desktop applications a long time ago.

So in theory we need to build a web-based login using a WebView and catch the responses to get the access tokens generated there. But on the other side we still need to login the classic way because the server still uses the old username/password to check the login itself... there are plans and suggested code changes to fix that as well but..

Summary: It's all incompatible with each other and a transitional approach seems almost impossible. We'd need a big-bang golive deprecating all older clients with the risk of nobody being able to login to FAF for quite some time.

This is hell and there is no easy way out, but there is no alternative in the long run. And we can't just wait for the next explosion to pressure us šŸ˜ž

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
ā€“ Benno Rice

Wow! thanks for this post Brutus, it's good to know about the Atlases that keep FAF up and their daily toils, otherwise it would be assumed these toils do not exist.

It seems like a big revamp of both the client and the server is in order.

but how to solve this?

if I recall making the FAF client web-based was not an option.

And the right lib for Java? that's never happening.

In 2021 Java is a ghost town.

Given the fact that I earn my money with the Java ecosystem I can assure you it's not a ghost town. It's very alive on server. It's also still growing on Mobile (Android). Our UI toolkit is still under development as well.

But "the big players" (as in "the big frameworks") aren't focusing on desktop web applications anymore, because they are a hassle to deploy etc and there are better solutions. We can still build or own solution. But it requires extra attention because authorization is such a critical topic.

With regard to web based FAF clients: In 2015 it wasn't an option due to the architecture of FAF. The java client enforced some good developments in the right directions (e.g. a REST based API instead of TCP based protocols and php sites scattered around). Also the server has made some smaller steps into that direction (e.g. you can now connect using UTF8 plain text in addition to UTF16 QString protocol). Still you can't just let all the client developers shift to web development instead. Other languages, other paradigms, different ecosystem.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
ā€“ Benno Rice

This is great reading. Especially about the auth woes. Makes you appreciate just how much unpaid time is put in.