Contribution Guidelines

Brutus5000

Hello everyone,

in the last week we had two major events with key persons of FAF history stepping out of the picture.

Last Sunday on the first regular general meeting of the FAF association we elected a new board. With this election Sheeo effectively steps down from running FAF business for the assocation as a president/board member and previously from the FAForever LLC after more than half a decade.

And today we finally transferred the last (but super import) asset of FAForever: the domain ownership and DNS access. As my dashboard is showing me so nicely I can tell you that Visionik took over the domain from ZePilot on 28th October 2014. For multiple years he invested a lot of time and money to keep things running while slowly stepping out of operational business.

On behalf of the FAForever association and the FAF community I'd like to say: Thank you for your engagement over such a big period of time. Without you FAForever wouldn't be what it is today.

We promise not to break it.

Brutus5000

Contribution Guidelines

Version from 18.03.2022

Introduction

These contribution guidelines apply to all spaces managed by the FAForever project, including IRC, Forum, Discord, Zulip, issue trackers, wikis, blogs, Twitter, YouTube, Twitch, and any other communication channel used by contributors.

We expect these guidelines to be honored by everyone who contributes to the FAForever community formally or informally, or claims any affiliation with the Association and especially when representing FAF, in any role.

These guidelines are not exhaustive or complete. They serve to distill our common understanding of a collaborative, shared environment and goals. We expect them to be followed in spirit as much as in the letter, so that they can enrich all of us and the technical communities in which we participate.
They may be supplemented by further rules specifying desired and undesired behaviour in certain areas.

The Rules of the FAForever community also apply to contributors.

Specific Guidelines

We strive to:

Be open. We invite anyone to participate in our community. We preferably use public methods of communication for project-related messages, unless discussing something sensitive. This applies to messages for help or project-related support, too; not only is a public support request much more likely to result in an answer to a question, it also makes sure that any inadvertent mistakes made by people answering will be more easily detected and corrected.
Be empathetic, welcoming, friendly, and patient. We work together to resolve conflict, assume good intentions, and do our best to act in an empathetic fashion. We may all experience some frustration from time to time, but we do not allow frustration to turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. We should be respectful when dealing with other community members as well as with people outside our community.
Be collaborative. Our work will be used by other people, and in turn we will depend on the work of others. When we make something for the benefit of the project, we are willing to explain to others how it works, so that they can build on the work to make it even better. Any decision we make will affect users and colleagues, and we take those consequences seriously when making decisions.
Be inquisitive. Nobody knows everything! Asking questions early avoids many problems later, so questions are encouraged, though they may be directed to the appropriate forum. Those who are asked should be responsive and helpful, within the context of our shared goal of improving the FAForever project.
Be careful in the words that we choose. Whether we are participating as professionals or volunteers, we value professionalism in all interactions, and take responsibility for our own speech. Be kind to others. Do not insult or put down other participants. Harassment is not acceptable.
Be concise. Keep in mind that what you write once will be read by dozens of persons. Writing a short message means people can understand the conversation as efficiently as possible. When a long explanation is necessary, consider adding a summary.
Try to bring new ideas to a conversation so that each message adds something unique to the thread, keeping in mind that the rest of the thread still contains the other messages with arguments that have already been made.
Try to stay on topic, especially in discussions that are already fairly large.
Step down considerately. Members of every project come and go. When somebody leaves or disengages from the project they should tell people they are leaving and take the proper steps to ensure that others can pick up where they left off. In doing so, they should remain respectful of those who continue to participate in the project and should not misrepresent the project's goals or achievements. Likewise, community members should respect any individual's choice to leave the project.

Diversity Statement

We welcome and encourage participation by everyone. We are committed to being a community that everyone feels good about joining. Although we may not be able to satisfy everyone, we will always work to treat everyone well.

No matter how you identify yourself or how others perceive you: we welcome you. Though no list can hope to be comprehensive, we explicitly honour diversity in: age, gender identity and expression, sexual orientation, neurotype, race, religion, nationality, culture, language, socioeconomic status, profession and technical ability.

Though we welcome people fluent in all languages, all official FAForever communication is conducted in English. Translations may be provided, but in case of contradictory wording, the English version takes precedence.

Reporting Guidelines

While these guidelines should be adhered to by contributors, we recognize that sometimes people may have a bad day, or be unaware of some of the guidelines in this document. If you believe someone is violating these guidelines, you may reply to them and point them out. Such messages may be in public or in private, whatever is most appropriate. Assume good faith; it is more likely that participants are unaware of their bad behaviour than that they intentionally try to degrade the quality of the discussion. Should there be difficulties in dealing with the situation, you may report your compliance issues in confidence to either:

The president of the FAForever association: [email protected]
Any other board member of the association as listed in our forum.

If the violation is in documentation or code, for example inappropriate word choice within official documentation, we ask that people report these privately to the project maintainers or to the DevOps Team Lead.

Endnotes

This statement was copied and modified from the Apache Software Foundation Code Of Conduct and it’s honoured predecessors.

Brutus5000

After nearly a decade of relentless pursuit—researching, begging, bribing, schmoozing, annoying, plotting, and exchanging more favors than we care to admit—we’ve finally done it. The impossible has become reality: We have the source code for Supreme Commander: Forged Alliance.

Not only that, but we’ve already cleaned it up and made it compatible with modern systems, running smoothly on Windows 10+ and (of course) Linux.

Starting today, the code is publicly available on GitHub:
https://github.com/FAForever/SCFA_source

Of course, as expected, this comes with a few caveats. We do not have copyright permissions for the game assets, nor do we have official authorization to use the brand. However, this breakthrough still unlocks unprecedented opportunities: we can now fix engine-level bugs, expand the game’s capabilities beyond anything we’ve ever imagined, and push FAF into a new era.

This is more than we ever hoped for. The future of FAF has never looked brighter.

Brutus5000

It's been almost a year since my last update.

Unfortunately apart from some initial progress with adding a web-based UI not much happened to the ICE adapter repo. I took multiple approaches to de-spaghetti the code, but still I couldn't even make sense of what I'm trying to change there. And everytime I tried to refactor something I ended up in modules that do not even belong there.

The lack of any kind of automated tests and the lack of the ability to run 2 ice adapters in parallel from the same IDE made it impossible for me to gain any progress.

So a month ago I started with a new approach. I tried to use the Ice4J library (the fundamental library that the ICE adapter is built around) in a standalone project and tried to figure out how to use it. Also ChatGPT was a big help, as it creates better docs than the original authors...

Starting from this and then decomposing the ice adapter classes I could iteratively figure out how the ice adapter actually works. So with each iteration I drew my learnings into a diagram until I had a good overview over how it actually works. This can be found here.

Based on these insights I started to slowly build up a brand new, cleaner implementation of the ice adapter in Kotlin. It's far from done yet, but it already has an integration test that connects 2 ice adapters locally, which proves me that this is possible.

The code was published today and can be found in this new repository.

The next step is to achieve functional parity with the java implementation. This might take a few more months though. So stay tuned for the update next year

Brutus5000

Hello there,

recently I got asked a lot why we can't solve FAFs infrastructure related problems and outages with investing more money into it. This is a fair question, but so far I backed out of investing the time to explain it.

Feel free to ask additional questions, I probably won't cover everything in the first attempt. I might update this post according to your questions.

The implications of an open source project

When Vision bought FAF from ZePilot, he released all source code under open source licenses for a very good reason: Keeping FAF open source ensures that no matter what happens, in the ultimate worst case, other people can take the source code and run FAF on their own.

The FAF principles

To keep this goal, we need to follow a few principles:

All software services used in FAF must be open source, not just the ones written for FAF, but the complementary ones as well. => Every interested person has access to all pieces.
Every developer should be capable to run FAF core on a local machine. => Every interested person can start developing on standard hardware.
Every developer should be capable of replicating the whole setup. => Should FAF ever close down someone else can run a copy of it without needing a master’s degree in FAFology or be a professional IT SysAdmin.
The use of external services should be avoided as they cost money and can go out of business. => Every interested person can run a clone of FAF on every hoster in the world.

Software setup

Since the beginning, but evermore growing, the FAF ecosystem has expanded and included many additional software pieces that interact with what I call the "core components".

As of now running a fully fledged FAF requires you to operate 30 (!!) different services. As stated earlier due to our principles every single one of these is open source too.

As you can imagine this huge amount of different services serves very different purposes and requirements. Some of them are tightly coupled with others, such as IRC is an essential part of the FAF experience, while others can mostly run standalone like our wiki. For some purposes we didn't have any choice what to use, but are happy that there is at least one software meeting our requirements. Some software services were build in a time, when distributed systems, zero-downtime maintenance and high-availability weren't goals for small non-commercial projects. As a result they barely support such "modern" approaches on running software.

Current Architecture

The simple single

FAF was started as a one-man project as a student hobby project. It's main purpose was to keep a game alive that was about to be abandoned by its publisher. At that time nobody imagined that 300k users might register there, that 10 million replays would be stored or that 2100 user login at the same time.

From the core FAF was build to run on a single machine, with all services running as a single monolithic instance. There are lots of benefits there:

From a software perspective this simplifies a few things: Network failures between services are impossible. Latency between services is not a problem. All services can access all files from other services if required. Correctly configured there is not much that can happen.
It reduces the administration effort (permissions, housekeeping, monitoring, updates) to one main and one test server.
There is only one place where backups need to be made.
It all fits into a huge docker-compose stack which also achieves principles #2 and #3.
Resource sharing:
** "Generic" services such as MySQL or RabbitMQ can be reused for different use cases with no additional setup required.
** Overall in the server: Not all apps take the same CPU usage at the same time. Having them on the same machine reduces the overall need for CPU performance.
Cost benefits are huge, as we have one machine containing lots of CPU power, RAM and disk space. In modern cloud instances you pay for each of these individually.

In this setup the only way to scale up is by using a bigger machines ("vertical scaling"). This is what we did in the past.

Single point of failures everywhere

A single point of failure is a single component that will take down or degrade the system as a whole if it fails. For FAF that means: You can't play a game.

Currently FAF is full of them:

If the main server crashes or is degraded because of an attack basically or other reasons (e.g. disk full) all services become unavailable.
If Traefik (our reverse proxy) goes gown, all web services go down.
If the MySQL database goes down, the lobby server, the user service, the api, the replay server, wordpress (and therefore the news on the website), the wiki, the irc services go down as well.
If the user service or Ory Hydra goes down, nobody can login with the client.
If the lobby server goes down, nobody can login with the client or play a game.
If the api goes down, nobody can start a new game or login into some services, also the voting goes down.
If coturn goes down, current games die and no new games can be played unless you can live with the ping caused by the roundtrip to our Australian coturn server.
If the content server goes down, the clients can't download remote configuration and the client crashes on launch. Connected players can't download patches, maps, mods or replays anymore.

At first, the list may look like an unfathomable catalogue of risks, but in practice the risk on a single server is moderate - usually.

Problem analysis

FAF has many bugs, issues and problems. Not all of them are critical, but for sure all of them are annoying.

Downtime history

We had known downtimes in the past because of:

Global DNS issues
Regular server updates
Server performance degrading by DoS
Server disk running full
MySQL and/or API overloaded because of misbehaving internal services
MySQL/API overload because of bad caching setup in api and clients
MySQL and/or API overload because too many users online

The main problem here, is to figure out what actually causes a downtime or degradation of service. It's not always obvious. Especially the last 3 items look pretty much the same from a server monitoring perspective.

The last item is the only one that can be solved by scaling to a more to a more juicy server! And even than in many cases it's not required but can be avoided by tweaking the system even more.

I hope it comes clear that just throwing money at a more juicy server can only prevent a small fraction of downtime reason from happening.

Complexity & Personnel situation

As aforementioned, FAF runs on a complex setup of 30 different services on a docker compose stack on a manually managed setup.
On top of that we also develop our own client using these services.

Let's put this into perspective:
I'm a professional software engineer. In the last 3 years I worked on a b2b product (=> required availability Mo-Fr 9 to 5, excluding public holidays) that has 8 full time devs, 1 full time linux admin / devops expert and 2 business experts and several customer support teams.

FAF on the other hand has roughly estimated 2-3x the complexity of said product. It runs 24/7 and special attention on public holidays. It uses

We have no full time developers, we have no full time server admins, we have no support team. There are 2 people with server access who do that in their free time after their regular work. There are 4-5 recurring core contributors working on the backend and a lot more rogue ones who occasionally add stuff.

(Sometimes it makes me wonder how we ever got this far.)

Why hiring people is not an option

It seems obvious that if we can't solve problems by throwing money for a better server, then we maybe should throw money at people and start hiring them.

This is a multi-dimensional problem.

Costs

I'm from Germany and therefore I know that contracting an experienced software developer as a freelancer from west-europe working full time costs between 70k-140k€ per year.
That would be 5800€-11500€ a month. Remember: Our total Patreon income as of now is roundabout 600€.

So let's get cheaper. I'm working with highly qualified romanian colleagues and they are much cheaper. You get the same there for probably 30k € per year.
Even a 50% part time developer would exceed the Patreon income twice with 1250€ per month.

Skills and motivation

Imagine you are the one and only full time paid FAF developer. You need a huge skillset to work on everything:

Linux / server administration
Docker
Java, Spring, JavaFX
Python
JavaScript
C#
Bash
SQL
Network stack (TCP/IP, ICE)
All the weird other services we use

This takes quite a while and at the beginning you won't be that productive. But eventually once you master all this or even half of it, your market value has probably doubled or tripled and now you can earn much more money.

If you just came for the money, why would you stay if you can now earn double? This is not dry theory, for freelancers this is totally normal and in the regions such as east-europe this is also much more common for regularly paid employees.

So after our developer went through the pain and learned all that hard stuff, after 2,3 years he leaves. And the cycle begins anew.

Probably no external developer will ever have the intrinsic motivation to stay at FAF since there is no perspective.

Competing with the volunteers

So assume we hired a developer and you are a senior FAF veteran.

Scenario 1)

Who's gonna teach him? You? So he get's paid and you don't? That's not fair. I leave.

This is the one I would expect from the majority of contributors (even myself). I personally find it hard to see Gyle earning 1000€ a month and myself getting nothing (to be fair I never asked to be paid or opened a Patreon so there is technically no reason to complain).

Scenario 2)

Woah I'm so overworked and their is finally someone else to do it. I can finally start playing rather doing dev work.

In this scenario while we wouldn't drive contributors away, the total work done might still remain the same or even decline.

Scenario 3)

Yeah I'm the one who got hired. Cool. But now it's my job and I don't want to lose fun. Instead of working from 20:00 to 02:00 I'll just keep it from 9 to 5.

This is what would probably happen if you hired me. You can't work on the same thing day and night. In total you might invest a little more time, but not that much. Of course you are still more concentrated if it's your main job rather then doing it after your main job.

Who is the boss?

So assume we hired a developer. Are the FAF veterans telling the developer what to do? Or is the developer guiding the team now? Everybody has different opinions, but now one dude has a significant amount of time and can push through all the changes and ignore previous "gentlemen agreements".

Is it the main task to merge the pull requests of other contributors? Should he work only on ground works?

This would be a very difficult task. Or maybe not? I don't know.

Nobody works 24/7

Even if you hire one developer, FAF still runs 24/7. So there is 16 hours where theres no developer available. A developer is on vacation and not available during public holidays.

One developer doesn't solve all problems. But he creates many new ones.

Alternative options

So if throwing money at one server doesn't work and hiring a developer doesn't work, what can we do with our money?
How about:

Buying more than one server!
Outsource the complexity!

How do big companies achieve high availability? They don't run stuff critical parts on one server, but on multiple servers instead. The idea behind this is to remove any single point of failures.

First of all Dr. Google tells you how that works:

Dr. Google: Instead of one server you have n servers. Each application should run on at least two servers simultaneously, so the service keeps running if one server or application dies.

You: But what happens if the other server or application dies too?

Dr. Google: Well in best case you have some kind of orchestrator running that makes sure, that as soon as one app or one server dies it is either restarted on the same server or started on another server.

You: But how do I know if my app or the server died?

Dr. Google: In order to achieve that all services need to offer a healthcheck endpoint. That is basically a website that the orchestrator can call on your service, to see if it is still working.

You: But now domains such as api.faforever.com need to point to two servers?

Dr. Google: Now you need to put in a loadbalancer. That will forward user requests to one of the services.

You: But wait a second. If my application can be anywhere, where does content server read the files from? Or the replay server write them to?

Dr. Google: Well, in order to make this work you no longer just store it on the disk of the server you are running on, but on a storage place in the network. It's called CEPH storage.

You: But how do I monitor all my apps and servers now?

Dr. Google: Don't worry there are industry standard tools to scrape your services and collect data.

Sounds complicated? Yes it is. But fortunately there is an industry standard called Kubernetes that orchestrates service in a cluster on top of Docker (reminder: Docker is what you use already).

You: Great. But wait a second, can Kubernetes you even run 2 parallel databases?

Dr. Google: No. That's something you need to setup and configure manually.

You: But I don't want to do that?!

Dr. Google: Don't worry. You can rent a high-available database from us.

You: Hmm not so great. How about RabbitMQ?

Dr. Google: That has a high availability mode, it's easy to setup it just halves your performance and breaks every now and then... or you just rent it from our partner int the marketplace!

You: Ooookay? Well at least we have a solution right? So I guess the replay server can run twice right?

Dr. Google: Eeerm no. Imagine your game has 4 players and 2 of them end up on the one replay server and 2 end up on the other. So why don't you store 2 replays?

You: Well that might work hmm no idea. But the lobby server. That will work right?

Dr. Google: No! The lobby server keeps all state in memory. Depending on which one you connect to, you will only see the games on that one. You need to put the state into redis. Your containers need to be stateless! And don't forget to have a high available redis!

You: Let me guess: I can find it on your marketplace?

Dr. Google: We have a fast learner here!

You: Tell me the truth. Does clustering work with FAF at all?

Dr. Google: No it does not, at least not without larger rewrites of several applications. Also many of the apps you use aren't capable of running in multiple instances so you need to replace them or enhance them. But you want to reduce downtimes right? Riiiight?

-- 4 years later --

You: Oooh I finally did that. Can we move to the cluster now?

Dr. Google: Sure! That's 270$ a month for the HA-managed MySQL database. 150$ for the HA-managed RabbitMQ, 50$ for the HA-managed redis, 40$ a month for the 1TB cloud storage and 500$ for the traffic it causes. Don't forget your Kubernetes nodes, you need roundabout 2x 8 cores for 250$, and 2* 12GB RAM for 50$, 30$ for the stackdriver logging. That's at total of 1340$ per month. Oh wait you also needed a test server...??

You: But on Hetzner I only paid 75$ per month!!

Dr. Google: Yes but it wasn't managed and HIGHLY AVAILABLE.

You: But then you run it all for me and it's always available right?

Dr. Google: Yes of course... I mean... You still need to size your cluster right, deploy the apps to kubernetes, setup the monitoring, configure the ingress routes, oh and we reserve one slot per week where we are allowed to restart the database for upd... OH GOD HE PULLED A GUN!!

Brutus5000

This article describes a project around the FAF ICE adapter. If you have no clue what it is I'd recommend to read this blog post first.

Some of you might think now: "Dude, why do you have time to blog? We have bigger problems! Fix the gmail issue right now!" Unfortunately my life is very constrained due to my very young children, so sometimes I can work on little projects but can't tackle server side things.

Why now?

After staying away from it successfully I recently started digging into the inner workings of the ICE adapter. There are plenty of reasons that led to this decision:

The ICE adapter is a critical part of the infrastructure. But only the original author knows the code base and claimed it to be in a bad shape (which after thorough analysis can agree to). Both are basically unacceptable facts for the long-running health of FAF.
We tried adding more coturn servers to improve the situation on non-Europe continents, but we're facing some issues that could be best solved in the ICE adapter itself.
The previous author of the ICE adapter has a serious lack of time to implement features.
The ICE adapter still relied on Java 8 (the last release of Java where the JavaFX ui libraries where bundled to the Java release), but all other pieces are already on Java 17 (!). Right now it only works in the Java client due to some dirty hacks.

Constraints

The ICE adapter is a very fragile piece of software as we have learnt with some attempted changes that required a rollback to previous state more than once. The problem here is that even with intensive testing in the past with 20+ users, we still encountered users with issues that never occured during testing. Every computer, every system, every user has a different setup of hardware, operating system (and patch level), configuration, permissions, anti virus, firewall setups, internet providers and so on.

Every single variable can break connectivity and we will never know why.
This led to the point that the fear of breaking something pushed us back from adding potential improvements.

Analysis

Before I started refactoring I went through the code to gain a better understanding and noticed a few points:

the release build process still relied on Travis CI which no longer works
many libraries the ice adapter is built on are outdated
we forked some libraries, made some changes and then never kept up with the upstream changes
some code areas reinvent the wheel (e.g. custom argument parsing)
ice adapter state is shared all over the application in static variables with no encapsulation
a lot threads are spawned for simple tasks
a lot of thread synchronization is going on as every thread can (and also does) modify the static (global) state

Almost none of this is related to actual ICE / networking logic. So improving the code here would make maintaining it easier and would also make it easier for future developers to dive into the code without much risk of breaking anything.

First steps and struggles

First of all I didn't want to continue developing on Java 8, as it reached its end-of-life now and the language itself made some nice progress in the last 6 years. So I migrated to JDK17 which meant also fixing the library situation for the JavaFX ui. Before JavaFX was bundled with the JDK, now it comes as it's own libraries. That has a drawback though: The libraries are platform specific, thus we need to build now a windows and a linux version.
Handling the platform specific libraries also made me migrate the build pipeline from Travis CI to Github actions (as almost all FAF projects are by now). Now we also have a nice workflow in the Github UI to build a release.
When trying to integrate the new version into the client I found out about the hacky way how we made the current ICE adapter working with the old Java 8 version despite not having JavaFX on board. Actually the javafx libraries of the client were passed to the ICE adapter. So I could use that too! But that needed a 3rd release without JavaFX libraries inside. This required further changes to the build pipeline (we still need dedicated win/linux versions for non-java clients!).
When testing the new ICE adapter release I was surprised as I could no longer open the debug window. But it turned out to be broken all along on previous versions. The code to inject the JavaFX libraries into the ICE adapter did not take into account, that the Java classpath separator for multiple files is different on Windows (Semicolon) and Linix (colon). So I actually fixed that, hurray!
I replaced the custom argument parsing with a well-used library called PicoCLI. This makes reading and adding new command line arguments in code much easier.

The switch to Java 17 is a potential breaking change. Thus the 3 changes above already ended in a new release that will be shipped to you probably with the next client release.

My next attempt was to remove all the static variables and encapsulate the state of the application to avoid a lot of the multi-threaded problems that potential lurk everywhere. However doing this I struggled mainly because the ICE adapter <-> JavaFX usage:

the ICE adapter has an unusual way of launching it's GUI after the application is already running
Java UI always needs to run in a separate main thread some weird
JavaFX doesn't give you a handle on the application window you launch and you can't pass it arguments

Also the UI debug logic leaks into every component and tries to grab data from everywhere. So a good and uncritical refactoring would be to rewrite the UI part...

More requirements

Slim it down

I already mentioned that we have to consider the non-Java clients. @Eternal is building a new one and there is also still the Python client. For the non-java clients the ICE adapter is a pain for packaging, because they need to ship a Java Runtime (~100mb) + an ICE adapter with UI libraries (~50mb) for a very "tiny" tool.

Eternal recently asked whether it is possible to ship a lighter Java runtime. Unfortunately the current answer is "it's to much effort". Actually the Java ecosystem has acquired features to build native executables from Java application (via the new GraalVM compiler) and this was also extended for JavaFX applications (Gluon Substrate). However with JavaFX this is very complicated and requires a lot of knowledge and experience we don't have.

It would be a more realistic goal if the ICE adapter wouldn't require a Java GUI. As someone who was recently involved with some more Web development I was thinking about shipping a browser-based GUI connected to the ICE adapter via WebSocket.

We need more data

When I was designing a websocket protocol for a ice adapter <-> communication I was struck by an interesting idea.

What if we could use this data stream to track the connectivity state and gather some telemetry? This could give us insight about which connections are used most, which regions struggle the most, or if an update made things worse.

Thus I started working on a Telemetry Service that would be capable of collecting all the data of the ice adapters.

Full game transparency

But the idea started to mutate even further. Why would you want to see only your ICE adapter debug information? Maybe you want to see where a connection issue is happening right now between other players.

Also why would I bundle the UI logic with an ice adapter release, when it could be a centrally deployed web app, that can be updated independent from ice adapter releases!

So in this scenario the ICE adapter sends all of its data to the telemetry server. Players then connect to the telemetry server ui and can see a debug view of all players connected to the game and each other.

This is what I've been working on the last 3 weeks, and it's in a state where we can replace the ui and see the whole game state for all players. But we all know: Pics or didn't happen, so here is a current screenshot of the future ui (with fake data):

A new roadmap

So here is the battle plan for the future:

Release the Java17 ICE adapter to the world.
Finish the basic telemetry server and ice adapter logic and ship it for testing (keep the old debug ui for comparison)
Persist telemetry data into some meaningful KPIs, so we can observe the impact of new ice adapter versions
Drop the old debug UI and continue refactoring the ICE adapter into a better non-static architecture
Update the core ICE libraries and see if things improve
Try building native executables for the ice adapter

Are you interested to join this quite new project? (The telemetry server is really small, written in Kotlin with Micronaut framework). This is your chance to get into FAF development on something with comprehensible complexity! Contact me!

Brutus5000

As many of you may have noticed yesterday was a bad day for FAF.

Context

In the background we are currently working on a migration of services from Docker to Kubernetes (for almost two years by now actually...). Now we were in a state where we wanted to migrate the first services (in particular: the user service (user.faforever.com) and ory hydra (hydra.faforever.com). In order to do this we needed to make the database available in Kubernetes.

This however is tricky: in Docker every service has a hostname. faf-db is the name of our database. It also has an ip-adress but that ip address is not stable. The best way to make a docker service available for Kubernetes on the same host, is to expose the database on the host network. But right now the database is only available on the host from 127.0.0.1, not from inside the Kubernetes network. This required a change to the faf-db container and would have caused a downtime. As an alternative we use a tcp proxy bound to a different port. As a result a test version of our login services were working, while the database pointed to the proxy port. Now we planned expose the actual MariaDB port with the next server restart...

Another thing to know:
We manage all our Kubernetes secrets in a cloud service called infisical. You can managed secrets for multiple environment there, and changes are directly synced to the cluster. This simplifies handling a lot.

Yesterday morning

It all started with a seemingly well-known routine called server restart.
We had planned it because the server was running multiple months without restart aka unpatched Linux kernel.

So before work I applied the change and restarted the server.

Along with the restart we applied the change as described above: we made the MariaDB database port available for the for everybody on the network and not just 127.0.0.1. It is still protected via firewall, but this changed allowed it to use it from our internal K8s.

That actually worked well... or so I thought..

More Kubernetes testing

Now with the docker change in place I wanted to test if our login services now work on Kubernetes too. Unfortunately I made two changes which had much more impact than planned.
First I updated the connection string of the login service to use the new port. Secondly I absent-minded set the endpoint of the user service to match the official one so e.g. user.faforever.com now pointed to k8s. thirdly I set the environment to K8s because this shows up in the top left of the login screen for all places except production.

Now we have two pair of components running

A docker user service talking to a docker Ory Hydra
A K8s user service talking to a K8s Ory Hydra

What I wasn't aware (this is all new to us):

If an app from Docker and an app from K8s compete for the same DNS record, the K8s app wins. So all users where pointed to the k8s user service talking to the K8s Ory Hydra.
By changing the environment, I also changed the place, where our Kubernetes app "infisical" tries to download it's secrets. So now it pointed to an environment "K8s" which didn't exist and didn't have secrets. Thus the updated connection string could not be synced with K8s, leaving Ory Hydra with a broken connection string incapable of passing through logins.

So there were two different errors stacked on top of each other. Both difficult to find.

One fuckup rarely comes alone

Unfortunately in the meantime yet ANOTHER error occured. We assume that the operating system for some reason ran out of file descriptor or something causing weird errors, we are still unsure. The effect was this:

The docker side Ory Hydra was still running as usual. For whatever reason it could no longer reach the existing database, even after a restart. We have never seen that error before, and we still don't know what caused it.
Also the IRC was suddenly affected kicking users out of the system once it reached a critical mass, leading to permanent reconnects from all connected clients leading to even more file descriptors created...

So now we had stacked 3 errors stacked on top of each other and even rolling back didn't solve the problem.

This all happened during my worktime and made it very difficult to thoroughly understand what was going on or easily fix it.

But when we finally found the errors we could at least fix the login. But the IRC error persisted, so we shut it down until the next morning when less people tried to connect.

Conclusions

The FAF client needs to stop instantly reconnect to IRC after a disconnect. There should be a waiting time with exponential backoff, to avoid overloading IRC. (It worked in the past, we didn't change it, we don't know why this is an issue now...)
The parallel usage of Docker and Kubernetes is problematic and we need to intensify our efforts to move everything.
More fuckups will happen because of 2., but we have to keep pushing.
Most important: The idea to make a change when less users are online is nice, but it conflicts with my personal time. The server was in broken state for more than half a day because I didn't have time to investigate (work, kids). The alternative is, to make these changes when I have free time: at the peak time of faf around 21.00-23.00 CET. This affects more users, but shortens troubleshooting time. What do you think? Write in the comments.

Brutus5000

Sometimes people ask us: Can't you just stop changing things and leave FAF as it is? The answer is no, if we do that FAF will eventually die. And today I'd like to explain why.

Preface

Do you still use Windows XP? According to some FAF users the best operating system ever. Or Winamp? Or Netscape Navigator? Or a device with Android 2.3?

Probably not. And with good reason: Software gets old and breaks. Maybe the installation fails, maybe it throws weird errors, maybe it's unusable now because it was build for tiny screen resolutions, maybe it depends on an internet service that no longer exists. There are all sorts of reasons that software breaks.

But what is the cause? And what does that mean for FAF?

Simplification and division of labor: About programming languages and libraries

People are lazy and want to make their lives easier. When the first computers were produced, they could only be programmed in machine language (assembler). In the 80s and 90s some very successful games like Transport Tycoon were written in assembler. This is still possible, but hardly anyone does it anymore. Effort and complexity are high and it works only on the processor whose dialect you program.

Nowadays we write and develop software in a high level language like C++, Java or Python. Some smart people then came up with the idea that it might not make much sense to program the same thing over and over again in every application: Opening files, loading data from the internet or playing music on the speakers. The idea of the library was born. In software development, a library is a collection of functions in code that any other developer can use without knowing the content in detail.

These libraries have yet another name, which sheds more light on the crux of the matter: dependencies. As soon as I as a developer use a library, my program is dependent on this library. Because without the library I cannot build and start my application. In times of the internet this is not a problem, because nothing gets lost. But the problem is a different one, we will get to that now.

The software life cycle

Even if it sounds banal, every piece of software (including the libraries mentioned) goes through a life cycle.
At the very beginning, the software is still very unstable and has few features. Often one speaks also of alpha and beta versions. This is not relevant for us, because we do not use them in FAF.

After that a software matures. More features. More people start using them. What happens? More bugs are found! Sometimes they are small, e.g. a wrong calculation, but sometimes they are big or security related problems. Those that crash your computer or allow malicious attackers to gain full access to the computer they are running on. Both on the FAF Server and on your computer at home a nightmare. So such bugs have to be fixed. And now?

Scenario A:
A new release is built. But: A new release of a dependency alone does not solve any problems. It must also be used in the applications that build on it! This means that all "upstream" projects based on it must also build a new release. And now imagine you use library X, which uses library Y, which in turn uses library Z. This may take some time. And 3 layers of libraries are still few. Many complex projects have dependencies up to 10 levels deep or more.

Scenario B:
There is no new release.

The company has discontinued the product, has another new product or is bankrupt.
The only developer has been hit by a bus or is fed up with his community and now only plays Fortnite.

Finally, all commercial software will end up in scenario B at the end of its life cycle. And in most cases open source software also builds on top commercial software directly or indirectly.

Just a few examples:

All Windows versions before Windows 10 are no longer developed. They have known security issues and you are advised to no longer use them.
The latest Direct X versions are only available on the latest Windows
Almost all Firefox versions older than 1 release are no longer supported (with a few exceptions)

What happens at the end of the lifecycle?
For a short period of time, probably nothing. But at some point shit hits the fan. Real world examples:

When people upgrade their operating system to Windows XP or newer some older Install Shield Wizards doesn't work anymore. Suddenly your precious Anno 1602 fails to install.
Your software assume the users Windows installation has a DVD codec or some ancient weird video codec to be installed, but Microsoft stopped shipping it in Windows 10 to save a few bucks.
There is an incompatibility in a version of Microsofts Visual C++ redistributable (if you ever wondered what that is, it's a C++ library for Windows available in a few hundred partially incompatible versions)

The impact on FAF

FAF has hundreds of dependencies. Some are managed by other organisations (e.g. the Spring framework for Java handles literally thousands of dependencies), but most are managed by ourselves.

A few examples that have cost us a lot of effort:

Operating system upgrades on the server
Python 2 is no longer supported, Python 3 is only supported until version 3.4 (affects Python client, lobby server, replay server)
Qt 4 was no longer supported (affected Python client), we needed to migrate to Qt5
All Java versions prior to 11 are no longer supported by Oracle (concerns API and Java client)
Windows updates affects all clients
Microsofts weird integration of OneDrive into Windows causes weird errors to pop up

Many of these changes required larger changes in the software codebases and also impacted the behavior of code. As source of new bugs to arise.

If we would freeze time and nothing would change, then all this would be no problem. But the software environment changes, whether we as developers want it to or not. You as a user install a new Windows, you download updates, you buy new computers, there is no way (and no reason) for us to prevent this.

And we must and want FAF to run on modern computers. And of course we want to make bug fixes from our dependencies available to you. So we need to adapt. FAF is alive. And life is change. But unfortunately in software change also brings new errors.

Everytime we upgrade a dependency we might introduce new bugs. And since we're not a million dollar company, we have no QA team to find this bugs before shipping.

Brutus5000

The ICE adapter disagrees.

Brutus5000

Hello everybody,

we apologize for the technical issues in the last 2 days. Nevertheless the vote ended and Morax is the official winner of the election.

We thank FtxCommando for his service in the last 2 years.

Voting details

The mode of vote was instant run off. So every round we eliminate the candidate with the least votes and transfer his votes to whatever the voter has defined as the next fallback vote.
Since we only had 3 candidates that makes it fairly easy to lookup.

Results of the 1st iteration (primary votes):

Votes	Candiate
289	Morax
201	Emperor_Penguin
175	FtXCommando
4	nobody

FtXCommando is last and his 175 votes are transferred to:

Votes	Candiate
96	Morax
46	Emperor_Penguin
33	nobody

This gives us results of the 2nd iteration:

Votes	Candiate
385	Morax
247	Emperor_Penguin

As it's only 2 candidates left, the one with the majority wins. In this case it's Morax.

Voting distribution over time

Also people wanted to know when was voted and if we can shorten future voting periods. Here is some data:

Day of Vote	# of votes	% of total votes	Accumulative %
1	283	42,2 %	42,2 %
2	79	11,8 %	54,0 %
3	31	4,6 %	58,7 %
4	10	1,5 %	60,1 %
5	10	1,5 %	61,6 %
6	14	2,1 %	63,7 %
7	8	1,2 %	64,9 %
8	7	1,0 %	66,0 %
9	4	0,6 %	66,6 %
10	8	1,2 %	67,8 %
11	6	0,9 %	68,7 %
12	5	0,7 %	69,4 %
13	4	0,6 %	70,0 %
14	5	0,7 %	70,7 %
15	15	2,2 %	73,0 %
16	7	1,0 %	74,0 %
17	6	0,9 %	74,9 %
18	75	11,2 %	86,1 %
19	23	3,4 %	89,6 %
20	7	1,0 %	90,6 %
21	9	1,3 %	91,9 %
22	10	1,5 %	93,4 %
23	11	1,6 %	95,1 %
24	10	1,5 %	96,6 %
25	5	0,7 %	97,3 %
26	1	0,1 %	97,5 %
27	7	1,0 %	98,5 %
28	3	0,4 %	99,0 %
29	4	0,6 %	99,6 %
30	3	0,4 %	100,0 %

Brutus5000

Yesterday I released the magical 1.0 release of the faf-moderator-client (mostly called "Mordor"). This is a remarkable milestone for me, as I reserved the 1.0 release for the feature complete version which I now believe to have achieved.

Thus it's time to take a break and recap the history of my first and biggest "standalone" contribution to FAF.

When I joined FAF development in 2017 (over 5 years ago) there was a lot of regular manual work in the background. Most of this revolved around uploading avatars, banning players, maps & mods. It put a lot of unnecessary work on the shoulders of the 4 admins we were back then (dukeluke, Downlord, I and in urgent cases also sheeo). Some of the stuff was wrapped in shell scripts and similar stuff to ease it out. Yet, every change required a person to login to the server and run these scripts.

Also in 2017 we introduced the re-invented API which moved from Python to Java and introduced a framework that allowed us to develop more features much much faster. Now we had an easy to read, edit or create things like bans, avatars and such. But we had no user interface for it.

So I started this as a complete new application on my own around mid 2017. Of course I still got advise from my good friend Downlord. And of course I copied a lot of code from other places such as the java client. But still this was the first application I ever wrote that was going to be beyond a certain complexity, actually useful and used by other people!

In Autumn 2017 things escalated to a dramatic level. A renegade group of players was banned for cheating and began destroying as many open lobbies and games as they could as an act of revenge by exploiting bugs. To this end, hundreds of new accounts were created daily.

Enter Mordor. Like Gandalf in Lord of the Rings told, so acted the moderator client: "A wizard is never late, nor is he early, he arrives precisely when he means to." Actually I like the quote so much that I named the first release gandalf and later on started naming major releases after movie/TV references or other funny names if nothing better came up.

With little effort, our moderators were suddenly able to block newly created accounts independently and within minutes, putting an end to the obnoxious activities of the exploiters.

Over the time more and more features and some were contributed by other developers. Still essentially it was "my" project were I was making decisions and carried the responsibility and it felt good.

So what is left after these 5 years?

Is the code high quality and following best practices?
No. It's just good enough that it barely works.
Is it well tested with automated test frameworks and setting the bar for future projects?
Totally not. That thing has exactly one test and that checks if the application can launch.
Would I do it the same way with the knowledge of today?
Not for all money in the world. Back then many people disliked the fact that I wrote a java based desktop application and recommended a web based approach and they were right back then (and even more 5 years later). But still I do believe that it was the right choice for me personally as the one who invested all that time. Simply because Java was a language I was familiar enough to get it started and enough help and references to get it working. If I had chosen to learn all the web related stuff, I probably would have been stuck at some point and lost interest. Also I wouldn't have learned all the painful issues when writing desktop applications.
So even though I usually frown upon web developing, today I understand why things are the way they are and have a much better understanding of the problems that they solve (better).
Maybe at some point in the future I will restart and do all the things in a web application. Just to get the same learning effect I had with mordor.
I truly believe that you don't become a good software engineer by reading books and articles about best practices. My experience so far tells me you have to get your hands dirty, work against some of the recommendations, make mistakes and fail so that you see why the best practices are the way they are, how they can help and when you should blatantly ignore them.
And now, for the ones who are still reading (wow, you really must be super bored!), here were the release names/themes:
- 0.1 Gandalf:
  “A wizard is never late, nor is he early, he arrives precisely when he means to.”
- 0.2 Dr. Jan Itor:
  "You will not ruin my Christmas. Not again. Not this year"
  This close-to-christmas release was a reference to my favourite TV show Scrubs.
- 0.3 Chinese Democracy
  This release allowed creating votes for FAF
- 0.4 Jeremy's farewell
  Obsoleted Softlys eye-cancer-coloured avatar management app
- 0.5 Police Academy
  The release made the content of the client tutorial tab editable
- 0.6 Paper War
  A reference to the new moderation-report feature
- 0.7 Modzarella
  "Because mods modding mods is cheesy."
  Added features for mods to manage the mod vault
- 0.8 Checks and balances
  Implementation of the permission system splitting off the power from the allmighty moderators
- 0.9 Maximum Break
  We added managing map pools. A reference to all the Snooker fans out there.
- 0.10 Secret Empire
  The only release name not chosen by me but by Sheikah instead. We migrated to our new OAuth service called Ory Hydra. A reference for all Marvel fans. I didn't get it, but I think he still nailed it.
- 1.0 Avengers: Endgame
  We hit the end of the road. Nobody is going to revive good old Tony (no, I actually didn't watch a single Marvel movie...)

Last fun fact: The 1.0 release seems to contain a critical bug, so there will be at least a 1.0.1

Brutus5000

I locked this topic (not moderators), because the user @Hashbrowns69 is a fake account that was just created today in order to perform some trolling.

However to people reading this.

I'd like to remind you that we are all working on our free time here. I have a full time job, a little son, a server to run and an association to setup.

Yes, I did not actively stop this nonsense (question is if I could), but I would like to remind you that apart from some drama nothing really happened now.

The server is running, the client receives updates, tournaments are still happening. Every organisation can have issues and yes this time it went public. That's sad but we're also all human.

FAF won't die just because there is some drama going on. There has been drama in the past, you just didn't notice.

Brutus5000

Hi everyone,

we will be moving to a bigger dedicated server in the next 2 weeks. We have not planned date yet, as there is a lot of preparation to be done. As usual we will announce outages in Discord, so you'll know what's going on.

This was planned for a longer time, but also pushed to the recent outages (even though we cannot guarantee that a faster server will resolve these issues):

The server scratches at its 64GB limit for quite some time now.
The disks are worn out (that's probably why lately one died and to our negative surprise Hetzner replaced it with an almost equally old disk....) and also full, so we're juggling around with external storage (also this is an ancient issue)
CPU peak load is reached more often lately.

The new server is a AX102 (details here). Effectively we triple the CPU power, double the RAM and double the storage compared to the current server.

Brutus5000

Today two thousand players encountered once again how fragile our whole server setup sometimes is and how an unlikely chain of events can cause major issues. In this talk I want to shed some light on how the chain started moving.

Start of the journey: The chat user list

We are currently investigating CPU load issues in the client which we could track down to updates of the chat user list in #aeolus. If you open the chat tab you see a partial list of users, as not the whole list fits to the screen.
We are using a ListView component from our UI framework for this. Each row or ListItem (a user or a group name) is one element in this list view. However, it would be very memory hungry, if the application would render each chat user's list item (~2000 on peak times) if you can only see a few dozen of them.
Therefore the list view component only holds a pool of list items that you can actually see and re-uses / redraws / repopulates them as the list changes.
When profiling the application I could observe hundreds of thousands of updates to the list over a period of 10-15 minutes with about 2.000 users logged in to the chat. Something was causing massive changes to the user list and we couldn't find a reason.

Mass changes out of nowhere

On a different topic we noticed an increasing number of errors when trying to login to the IRC on peak times for no obvious reason as well. Sometimes it took multiple minutes to login.

But then somebody using a regular IRC client looked into the logs and found an interesting thing: a lot of people were logging in and out in a very short period with the quit message "Max SendQ exceeded". A little bit of googling revealed, that this error message was thrown, when the IRC server sends data to the client, and the client doesn't load it fast enough.

Using the incredibly long chat log history of our moderators we were able to track down these errors to their first occurences to October 2019 when we roughly breached 1900 users online at once. Since then FAF has been growing (up to 2200 users online at peak) and so grew the problem as well.

So there is a coincidence between the SendQ exceeded error and the amount of relogin/logouts. Now we tried to connect the dots:

Whenever you join a channel you receive the list of users in this channel.
If this list goes beyond a certain size the error pops up.
The error caused lots of logins/logouts.
Each of these events caused the chat user list to update multiple times.

So the idea was: If we could solve the IRC error, we might solve / reduce the users CPU load. A potential solution was to increase the buffer size before the server sends the error message.

The IRC restart of Doom

The situation was perfect: 2000 users online. A configuration change. All users would automatically reconnect. We could live-test if the change would make the server go away or persist.

So we flipped the switch and restarted the IRC server.

We could observe the IRC numbers rising up from 0 to 1400 and then it drastically stopped. 600 users lost. What happened?

We expected the IRC reconnect to be simple and smooth. But when did anything ever went smooth? A look into the server console showed all cores at 100% load, most utilization by the API and our MySQL database. WTF happened?

The cache that cached too late

The FAF client has a few bad habits. One of them is, that it downloads information about all clans of all people online in the chat. This is a simplified approach to reduce some complexity.

Since we only have a few dozen active clans, that would only be a few dozen simple calls which would/should be in cache already. Or so we thought.

As it turns out, the caching didn't work as intended. The idea behind the caching is the following: You want to lookup a clan XYZ. First you look into the cache, and if you don't find it here, you ask our API. Asking the API does take a few dozen milliseconds.
But what happens, if you have 10 players from the same clan? You should look it up in the cache, and if you don't find it there, you would query it once, right? Well that's what we expected. But since the first API lookup wasn't finished when looking up the clan for 2nd to 10th time, each of these lookups would also ask the API.

Now imagine 2000 users querying the clans for every player with a clan tag (~200?). Boom. Our API and our database are not meant to handle such workloads. It didn't crash, but it slowed down. And it took so much CPU while doing this, that there wasn't much CPU left for the other services. It took the server roundabout 20 minutes to serve all requests and go back to normal.

Once we spotted that bug, it was an easy fix. As it turns out we just needed a sync=true to particular cases. Problem solved. Why oh why is this not the default? Well, as it turns out this option was added in a later version of our Spring framework and due to backwards compatibility it will always be an opt-in option. Ouchy.

Lessons learned

We saw it again: IRC sucks (well we all knew that before right?). But it seems like the IRC configuration change fixed the SendQ exceeded error. Yay, we actually might reduce the CPU usage for our players.
Also we now know that synchronized caching should be the default, at least in the FAF client
Unfortunately it was revealed again that the FAF api can't handle ultra-high workloads. Unfortunately limiting the amount of available CPUs in Docker does not work in out setup. Further solutions need to be evaluated (e.g. circuit breakers).

Recap and next steps

So now we know that restarting the IRC server can be a dangerous thing. So would I do it again? Hell yes. I know it sucks for the playerbase if the server is having massive lags due to running into CPU limit. But today we learned so much about what our software architecture is capable and where we have weaknesses.

We can use this to improve it. And we need to. FAF is growing more than I've ever expected. In the last 12 months our average-online-playerbase grew by ~300 users with the latest peak being 2200 users. When we moved to the new server beginning of the year I thought that our new server can handle around 3000 users, but as it turns out there is more to do.

Brutus5000

Many of you won't remember it, but the amount of posts in this forum and the feedback we collected showed us: the onboarding process along with the registration page is a horror for new users (especially non-english native speakers).

Together with our fellow community member Terrorblade, who is a working expert on customer journeys, we worked out weaknesses in the current process and ways to fix them.

Here are few things we are doing horribly wrong:

The website promotes the client download. But that is not the start of the journey. If you download and install the client as first step, you then figure out that you need to go back to the website to register and steam link first. In worst case you go Website landing page -> Client -> Website registration -> Client -> Website steam linking -> Client
The website is only available in english
The information is unstructured and not really offering much help where you need it
Errors in the process don't give the user any usefull help (e.g. technical api errors are printed in plain text)
Field validation is not always available (e.g. is the username already taken? You only find out after you click the registration button)
Hidden steps (you can only Steam link after you logged in!)

Due to the Steam link process and the required client download the FAF onboarding process is much more complicated than the that of other websites. We even modelled a perfect workflow, but that's beyond the scope of the current work. Nevertheless for the brave peopel here I'd like to share it with you, without further explanation:
![0_1605396838423_FAF User Journey.png](Uploading 100%)

In the last weeks I had the opportunity in my job to learn frontend development with Angular (and yet more trainings are to come). In order to train my newly gathered skills, I gave it a try. Unfortunately what I still don't have are skills for designing shit with CSS and colors.

All screenshots are representing work in progress:

Failed registration: As long as you didn't fix all issues and selected all checkboxes you can't click the button. The user lookup is done with a 1 second delay

You can dynamically switch the language. Currently I'm only populating english and german. The biggest need however would be russian.

Once you successfully registered we now (and only then) tell you, that you need to click on the email link:

Nevertheless, users are capable of switching between the steps to see whats ahead and what to wait for:

But using the activation url it looks simpler:

So after a successfull activation we can move on:

The Steam linking page needs some more info why this is required I guess:

The steam login itself is not interesting here, so I skip it. After a successfull login we get this:

And the FAF CLient Setup page is yet missing content completely. Not sure if we want to add screenshot per language, that sounds a bit too much work...

So far I have only shown the "happy path". The most important pain point for users is when the Steam linking doesn't work. We need to be able to distinguish between "You don't own the game" and "Your profile seems to be not public", but this needs some API changes as well.

I hope this change will reduce the amount of people we loose on the way to FAF. But it's still a long way and it might need much more time invested. So far I spent around 20 hours to reach this state (yet I'm facing many beginner problems as this is my first bigger Angular project, so it's a lot of basic learning time included).

Maybe I'll do some more Twitch streams and now the 5 people who saw my last (and first) stream know what they actually saw

Please share ideas and feedback guys!

Brutus5000

Around October 2023 hostile ex-community members launched the first wave of DDoS attacks. The FAF infrastructure and application landscape was not prepared for that. Services directly connecting to our main servers on many open ports, openly accessible APIs for the benefit of the community. People could run their own IRC bots. People built API scrapers and analytics tools. All of that had to vanish basically overnight.

In a rush the FAF team closed down open ends as good as possible. We migrated the lobby connection from raw TCP to web sockets, we put the formerly open API behind authentication. We changed the IRC server in favour of an implementation supporting HTTP based access. And then we hid almost all services behind Cloudflare proxy servers except for the server itself that was still reachable from the internet.

The weak point remained our ICE server aka Coturn servers. So we started paying for external TURN as a service provider and added more infrastructure around. But the feedback on connectivity was never good. Things never got back to where it was. I started digging into the ice adapter more than any before. Documented features. Tried to refactor it. Tried to rewrite it. We even tried to integrate Cloudflares new TURN service as it went live. The connectivity was horrible and the used payment according to Cloudflare would have ruined FAF financially within a month, so we had to disable it, not knowing what actually happened there.  At some point the DDoS more or less stopped and things settled a little more to normal and the topic of ICE fell into background noise.

Now for over two weeks now DDoS is back with cyber-terrorist alike demands (we either abide to the attackers terms or the DDoS continues forever). But the FAF team stance here is clear: We do not negotiate with terrorists.

Instead we continued hardening our servers. Our main server is no longer reachable directly from the internet, and has to pass multiple firewalls. Yet the bottleneck once again is the ICE connectivity.

So with the accumulated knowledge of the last years, we investigated and analysed the shit out of the ICE adapter with more tooling and (semi-)reproducible test methods. The results were not promising in multiple combinations:

Our Hetzner cloud servers have huge packet loss even outside DDoS (tested on fresh VMs) on ICE-related communication. We don’t know why, but it seems that Hetzner really doesn’t like this kind of traffic. Potential solutions: (a) Report packet loss to Hetzner in a structured way, (b) use ports that are for other traffic and thus more stable, (3) use a different provider
The „coturn“ software spew non-stop errors but was completely useless in logging why these errors occurred. So we tried out a different software called „eturnal“ (love the pun here), which gave us a better hint about problems
A Wireshark capture of a user trying to connect to Cloudflare showed us a single successful connection attempt followed by 80000 (!!) failed connection attempts in a 5 minute interval.

So apart from the Hetzner issues, we could boil it down to problems in the ice adapter. The ice adapter at its core is built around the „ice4j“ library. This is a piece of software that originally built for the Jitsi phone software (even though it was renamed a few times and by now is a commercial service). The only maintainers are Jitsi developers and as such the focus lies on the features of Jitsi. There is a component called Jitsu video bridge that is also open source. When we looked into it, we saw that Jitsi is not using TURN at all, and therefore not a big priority in ice4j. The code of ice4j has no documentation outside of regular Javadocs. And it looks like it was written in a C-programming style from the 90s (while Jitsi components are written in modern Kotlin). The worst part however is, that it is not possible to control or configure from the calling code.

This should not be a problem if the library does what it should. But as far as we can see by now, the TURN code of ice4j does not behave like it should. Whether it is violating the specifications is beyond my understanding. A single example that we could identify is that ice4j tries to establish a TURN session for all the ip addresses it could find: the external ip (that one makes sense), but also all internal network ips (nope, that does not make sense!). And this in particular is the reason why the ice adapter causes an endless log stream of errors in coturn: the attempt to establish a turn connection for a private network address is causing an authentication error… (ok - both software stacks here behave like idiots). That might also be the reason why there are so many Cloudflare login attempts? We don’t know.

What we do know is that we can rewrite the ice adapter all we want. As long as we choose Java or Kotlin we are bound to ice4j as it is the only notable library for ICE.

What we do know, is that ice is a low level protocol which is used by WebRTC. Which is used by every single browser and every single audio/video conferencing tool that runs in a browser. WebRTC is everywhere and its „data channel“ feature allows features that we wished for in the ice adapter in a long time (guaranteed and order submission of packets, keep alive functionality). So why go with ICE alone when we can have WebRTC with ice?

Now, there is a striving project for WebRTC called Pion with 14k stars on Github (ice4j: 500) and around 200 contributors (ice4j: 25) and lots and lots of example code. So where is the catch? The catch is: It is written in Go. And we have no Go developers at FAF - I never used it so far.

So what do you think: Should I stay (on ice4j) or should I Go (learning Go)?

Brutus5000

It's quite some time since my last blog post and therefore I want to give you a little update on what has happened since then.

A lot of issues cascaded and blocked each other, so in many terms it didn't really feel like much progress, because nothing got really finished. But the story behind it might be interesting for you nevertheless.

So here comes the probably longest issue and failure compilation of FAF we ever had (and it's just the stuff I remember)

Travis CI terminating services
Travis CI is a "continuous integration" platform that we use in almost every Github repository to run automated tests and to build releases.

Unfortunately it looks like Travis CI is in financial trouble and they massively reduced the capacities for open source projects. Out of a sudden almost every repository needed to wait half a day or longer for a single test run or building a release, and in many cases even those failed.

As a solution we migrated most of the core repositories to Github Actions, which is not only faster but nicer to use in many cases. But depending on the CI pipeline complexity that was an effort of multiple evenings until everything was sorted out.

There are still repositories waiting to be migrated

GoLive of the matchmaker
So shortly before Christmas we reached the final phase of the matchmaker. In the hurry a few points were overlooked which caused some troubles:

On releasing the public beta client release, we sort of derped on the version number. It doesn't follow the required version pattern. This leads to every user on this release will never ever again receive an update notification. Please update manually to the latest client release.
Replay ratings: Since the dawn of time, FAF had always just 1 rating per game, because it was either Global Rating or Ladder Rating or no rating at all. Additionally global and ladder rating summaries were stored in 2 different tables.
However having 1 table per leaderboard is not a healthy strategy if you plan to have many more of them. Long time before TMM we already moved to dynamic leaderboard based rating. Effectively this allows us to calculate more than one rating per game, but 1 rating per leaderboard per game (if we need to). Until the rollout of TMM the old tables were still populated though. This stopped with the rollout, effectively breaking the rating changes show in the replay view on the client.
Also the API endpoints for fetching the correct replays where still missing and added on short notice.
The official TMM client version also had issues on showing the right rating for the in game players. This led to maximum confusion whether the whole system worked and if ratings where actually accurately calculated. Eventually I'd say it was all correct, just the display was wrong.
The replay vault search by rating was removed due to a misunderstanding between Sheikah and me. Things got complicated here as well: now you don't have the one rating to search for, but you have to select from which leaderboard you want to look at the ratings.
Instead of looking into each leaderboard with multiple queries we tried to add the current leaderboard ratings for each player into the player object on API level.
Unfortunately this creates a suboptimal query that maximized MySQL cpu usage within minutes. This lead to a few server crashes until we figured out what had happened.

Moving to MariaDB
The MySQL issue showed us once again that we want to move to a new database version. The current MySQL 5.7 is quite old.
This discussion is not new, actually it's already almost 2 years old but got deferred because there was no real pressure (and we couldn't really agree whether to go to MySQL 8 or MariaDB 10.x which are both compatible with MySQL 5.7 but incompatible with each other).

The server downtime caused by the suboptimal query as described above made me reconsider. We had multiple issues with MySQLs shitty way of doing a specific type of join (nested loops). So we tested this with MySQL 8 and MariaDB 10.5 on a copy of the production database.

There was a simple query that took 1min 25s (utilising the whole PC). MySQL 8 took just 5 seconds less. But MariaDB obviously fixed the underlying issues and returned a result in 1 second.

So this set the new path to migrate to MariaDB. A test migration from raw MySQL 5.7 without backup and restore ended up in a corrupted data set. So we'll need to to proper backup and restore.

Unfortunately MariaDB tripped over one of the migration files when creating the current database scheme. Therefore I tried to merge all migrations into a single database schema import (which took a few days to figure out).

Once this was done I still encountered the error with a better description just to figure out that this was due to a bug in MariaDB that was marked as fixed already but no release yet! So it had to be postponed once again...

It's finally released by now, and we'll need to restart some testing. This is probably part of the next server update.

Auto-Backups killed the server
Our new server uses ZFS as file system which is very modern and powerfull. When we setup the server we thought using ZFS for the Docker (the container daemon running all apps) container volumes was a good idea. It wasn't, instead it was polluting the system with ZFS datasets for each docker container layer (there were HUNDREDS of them). Even though this caused no direct harm, it prevented us from activating auto-backups.

With the last server update we fixed this and are now back to a proper dataset list:

NAME            USED  AVAIL     REFER  MOUNTPOINT
tank            888G  34.4G     3.93G  legacy
tank/docker    32.7G  34.4G     29.1G  -
tank/fafstack   160G  34.4G      158G  legacy
tank/mongodb    353M  34.4G      277M  legacy
tank/mysql     13.5G  34.4G     12.4G  legacy
tank/replays    677G  34.4G      677G  legacy

And then we thought activating auto-backup was a good idea. But it was not. Even though using snapshots should only consume the "delta" between the current and the previous state on disk MySQL alone took ~20GB for 5 snapshots. This brought the server disk into critical area, where it started mounting parts of the file system as read only.

The solution was to remove the backup snapshots and restart the server and work on the next topic:

Replay recompression
When we rented the new server in January 2020 we were thinking about how much disk space we need and what the additional cost for more storage would be. Back then we came to the conclusion that 1 TB would suffice, and once we run into problems we'd optimize the replays stored on the disk.

The replay format dates back into ancient times, where some weird decisions were made. A replay file contains of 1 lines of JSON describing the content of the replay. After the line break there is a base64 encoded deflate-compressed.

The deflate compression algorithm is ancient and very inefficient compared to modern algorithms.
Base64 encoding is a way to force any kind of binary data into a representation of the ASCII character set. This might have made sense 2013 for download compatibility, but I highly doubt it. The massive disadvantage is, that base64 encoding brings an overhead of around 30%.

Improving this replay format is a topic again, that is well known and older than 2 years and always scheduled for "can be done when needed". Even though there were code changes lying around for the replay server and the java client.

The new plan is instead of zip + base64 we will compress the replay stream with the Zstandard algorithm developed by Facebook (the best compression algorithm around).

Actually the latest client release 1.4.1 added support for the new replay format, as we already saw the server disk going full (but we expected to have more time).

With the server crash due to disk space we reprioritized this issue. Now with the client release out we prepared a script to re-compress all existing replays to Zstandard (currently I'm doing some last backups, but I intend to start it tommorow).
With test excerpts we could see a reduction of 40% disk space. This will free up at least 200GB on the server, making room for another 2-4 years of new replays.

Current estimation say it will take 10 days if we run it without parallelization.

The OAuth cascade of Doom
In my last blog I presented a new registration page. Unfortunately when trying to finish it I hit a wall on login to our API. It took me over 2 months to figure out the whole complexity of the problem.

When we started work on the FAF API we selected OAuth2 as a standardized, mature and safe protocol for login to the web. And that is sort of still the case. Then we build a whole authorisation server into our API that takes care of the whole set of login flows. Currently we use 2 of these: If you login via the website or the forum, you get redirected to an API login page and after login redirected back. This is called the "implicit flow". The java client however uses a classic username and password approach. This is known as the "password flow".

Following standards and using well known frameworks I expected to be able to connect to the API from my registration app very easily. But it didn't work out. Today I know: This is related to the fact that the concrete OAuth2 implementation we use is customized in a non-standard way (which I never even was aware of). There is a better way of doing it now, with a new standard on top of OAuth2 called OpenID Connect.

Unfortunately our API library is deprecated and it's successor dropped support for the authorization server alltogether. Therefore a replacement was needed eventually.

I was looking into the space of cloud offerings if there are any free services that could work out for FAF. But as it turns out FAF is huuuge with it's 250.000+ playerbase. Running this as a service in the cloud would cost us hundreds, maybe thousands of Euros per month. (Currently we spend just ~80€ on the prod server total cost)

So we needed something self hosted. 2 candidates were evaluated: RedHat Keycloak and Ory Hydra.

RedHat Keycloak is the more mature product so I evaluated this first. After a week of experiments, it turned out to be completely unsuitable for FAF as it is built for enterprise where you e.g. need to give your firstname and lastname and can't select a username (or even change it).

Ory Hydra on the other hand is just one tool that does OAuth2 flows and nothing else and needs to be combined with your own software. So I wrote a new one with a new fancy login and everything (that's a whole story on its own) and after 2 months and a shitton of failures mostly related to fighting http vs https in an internal network I was capable to login to the test server.

But then I was stuck because the API needs to support the old way of login and the new way of login, as the tokens look different and the cryptographical signatures are different etc.. So I wrote a compatibility layer to support both.

But now we're stuck again. Ory Hydra takes security very seriously. And due to that they refuse to implement deprecated legac OAuth flows such as "password flow", which the FAF client uses...
So we need to implement a web-based login. But there is no library support out there because the open source world has moved away from desktop applications a long time ago.

So in theory we need to build a web-based login using a WebView and catch the responses to get the access tokens generated there. But on the other side we still need to login the classic way because the server still uses the old username/password to check the login itself... there are plans and suggested code changes to fix that as well but..

Summary: It's all incompatible with each other and a transitional approach seems almost impossible. We'd need a big-bang golive deprecating all older clients with the risk of nobody being able to login to FAF for quite some time.

This is hell and there is no easy way out, but there is no alternative in the long run. And we can't just wait for the next explosion to pressure us

Brutus5000

When I talk about FAF I usually say it has the complexity of enterprise company scale. Then people chuckle a little, but they don't really understand what that means.

I am currently evaluating moving the FAF Docker-Compose stack to Kubernetes (more on that in a future blog post), but it's really hard to understand and estimate such a project without knowing what you are actually dealing with.

So I finally took the time and made an architecture diagram showing all applications running in our FAF Docker-Compose stack and their dependencies to each other.

Drawing such a thing is almost impossible so I used graphviz instead. And this is the result:

Legend

Service Overview

Some edges are unnamed to avoid repetition:

Traefik serves all references services as reverse proxy
Prometheus scrapes referenced other services for metrics
Postal sends emails for all referenced services

The pentagram diagram is using the "circo" algorithm of graphviz. Alternatively the "dot" algorithm also give usable results (all others don't give readable results):

I won't bother explaining all of these application names. If you are interested, just google them and you will find something.

Brutus5000

The shittiest time of the year aka "winter" is finally coming to an end, and unlike the rest of nature we weren't in hibernation but instead already started early our spring cleanup

DNS cleanup & Cloudflare magic (download speed issue fix?)

For those who don't know about it, last year we formed the democratic FAForever association which took over the business from it's predecessor in the USA owned by visionik.

As one of the last steps we moved the domain and dns ownership to the association and this also allowed us to give access to the people actually working with it.

When we looked at it was a huge list of over 75 organically grown records with outdated or redundant information. After a nice cleanup of using wildcards and deleting outdated stuff we are down to 30 records.

This will hopefully speed up some development / deployment processes in the future.

During this we took care of 3 additional things:

Hopefully we fixed some issue with SPF records that caused our emails to be rejected by some mail providers or put them to spam.
We setup a Cloudflare proxy for our main website and the content server. The latter is the more important hopefully fixing a lot of previous issues with download speeds on patches and mods. Please comment your experience!
We tweaked our reverse proxy to use different mechanism for retrieving TLS certificates (but thats mostly internal stuff).

FAF emails leave the stone age

Around a month ago I resolved an issue that bugged me since I started development in FAF but never had the priority to do it. Do you remember the glorious day when you signed up for FAF? Or the last time you requested a password reset? Whenever you looked into your inbox you saw a horrible unformatted text email greeting you with the retro flair of the 90s.

When I first implemented user registration and password reset in the (back then Python in 2016!) api I took over the old server code email template with a note to myself to make it more pretty. Later on I migrated it to the current Java api (around 2 years later), wrapping the unformatted text in unformatted html once more with a reminder to make it more pretty in the future.

Later on we introduced Mautic inspired from the MailJet service. Mailjet allows emails to be predefined as templates that only need to be handed variables during sending. This effectively seperates styling and content of an email. Mautic allowed this on top of a lot of other features. But again it never was a priority and never got actually used for this purpose. Instead it got used for something else: A nicely styled automated welcome email for newjoiners.

Recently I became aware of several security and management issues with Mautic (more details here) and decided to remove it. But that would have caused a feature loss of this nicely styled welcome email. So finally emails receive some priority handling. And thanks to the MailJet company inventing a dedicated email markup language, FAF after 10 years finally received beautiful html emails. Some examples:

Account activation email

Welcome to FAF email

Brutus5000

Rolling Out - Compose to Kubernetes

Since I began working on the FAF infrastructure around 2017, I have successfully completed three full server migrations.
Along with this, we (the FAF server admins) brought many changes like NixOS as declarative operating system or ZFS as
file system. Yet, nothing compares to the change we recently introduced.

After over a year of testing and experimenting, we finally did it: transitioning the core services of FAF
from a Docker Compose stack to Kubernetes (K8s) on Thursday, May 16, 2024. It's worth noting that this was a pure
background change without any visible impact for our users.

Although this significant change brought a few hiccups, the overall transition proved smoother than expected, with
everything seeming to function as before.

Why We Transitioned to K8s: Understanding the Changes and Reasons

Previously we were running all services in a set of three Docker Compose files. Each service was running as a docker
container. Essentially, in Kubernetes everything also runs as Docker containers.

For that we had to run the docker daemon. With Kubernetes, we run k3s as daemon that takes care of everything.

We are still running a single node. We did not set up a large Kubernetes cluster. Also, we still mount volumes directly
into the ZFS filesystem of our host. So why even bother?

Well, Kubernetes was never about a single service in a container, it's about managing a large set of them.

And the FAF stack has become quite big with over 20 services to manage.

With Docker Compose we had to manage the containers itself via the shell on the system. Also configuration and secrets
in files were edited on the file system.

In Kubernetes we can manage the cluster using a web interface without needing to login to the server. The configuration
and services themselves are managed in a Git repository that gets automatically synchronized with the cluster. This
approach is called GitOps. Secrets are no longer managed in files but synchronized via a dedicated service called
Infisical.

I will explain more features in the next following topics.

The non goals and cons of Kubernetes

Kubernetes aims for many goals that we do not pursue. So even though they are often mentioned, the following bullet
points are aspects we did not aim for:

We did not aim to introduce a multi-node cluster. That would be needed to reach many of the goals below. This
does not fit the way FAF currently handles it storage. However, a single node is the first step to take if we have
needs for that in the future.
We did not aim to become more scalable. FAF does not need to scale to infinity. With the last server move we kind of
overprovisioned hoping to solve the DDoS issues, which unfortunately did not help.
We did not aim to enable "scale to zero". This only makes sense in fully managed "serverless" environments, to save
costs. This does not work on already cost-effective bare metal setups like FAF.
We did not aim for high availability. The overall dependability hinges on the availability of each required component,
and the current setup does not facilitate this. FAF has many architectural single points of failure that cannot
be solved with infrastructure.
We did not aim to move to a managed cluster like in AWS, GCP or Azure. The current setup is the most cost-efficient.

So now that we declined all the standard reasons to introduce Kubernetes, let us look at which burden we actually
brought to ourselves:

Kubernetes is much more complex than a plain Docker Compose setup.
- Kubernetes to Docker is like a Linux distribution to the Linux kernel. It is a whole suite of services running in
  the backend that could have bugs. So, the more complex, the higher the chance there is that we meet bugs.
- It introduces many new concepts and building blocks that work together from Pod (a single deployable unit
  consisting of one or more Docker containers) to Deployment (a meta-object that handles versioning, rollout and
  scaling of a pod definition), Config Maps up to 3rd party extensions in Custom Resource Definitions. Even though only use a subset of Kubernetes, overall we use a subset of around 20 resource types.
Kubernetes has a resource overhead. All the additional services mentioned above still need to run.
For configuring it correctly we rely on added new tooling such as Helm, Infisical and ArgoCD.
It is (currently) not possible to just spin up a subset of services for local testing and development.

With all that burden, does it look like an unwise decision to go to K8s? Has FAF just become a playground of resume
driven developers?
Of course, not. We had specific problems to solve and goals to achieve, all for a healthier long-term outcome.

Unveiling the Goals and Advantages of Kubernetes

We are running our Docker Compose setup for almost a decade. We optimized it extensively but encountered numerous
limitations in terms of:

accessing our server,
operating our services,
disparity between the described state in the Docker Compose git repository and the real configurations on
the server.

Thus, our immediate, primary goals were as defined:

Move from a customized, script-oriented usage of Docker Compose to a well documented, established and widely known
industry standard workflow
Manage the cluster state via GitOps (achieved via ArgoCD)
- All changes to the cluster are committed to a Git repository.
- These changes are then immediately deployed to the server.
- Make ssh-login to the server for regular duties obsolete
Empower application maintainers (without giving ssh server access!)
- offer a secure git-based workflow to update versions edit configuration themselves
- make the "real state" of configuration visible
Automatically scan for application updates (achieved via Renovate on the Git repo)
Avoid single services to overload the server by applying resource limits
Improve stability of services
- use health checks to restart broken services automatically
- use multiple replicas (where possible)
- enable configuration changes without downtime

All of these goals can be achieved by Kubernetes today.
This hopefully allows us to distribute the work among more people without handing over the golden key to all private (
GDPR-protected) data.

It also hopefully reduced time spent on the server for future updates.

Are we finished yet?

No. We still need to migrate shared base services, such as MariaDB, RabbitMQ or PostgreSQL.

Here we still need to investigate the best way.
Generally, we have the option to use operators which allow declarative configuration of services, e.g. the allowed
users with passwords.

This would be an improvement over the current script-based approach and would simplify the setup on
local machines. Also, the topic of automated testing on the test server should be investigated. The minimal docker based solution no
longer works.

As we keep working to make FAF's infrastructure better, we are sticking to our promise of doing it really well.

Thank you all for being a part of this journey.

Many thanks to @Magge and others for refining this article.

Posts

Contribution Guidelines

Introduction

Specific Guidelines

Diversity Statement

Reporting Guidelines

Endnotes

The implications of an open source project

The FAF principles

Software setup

Current Architecture

The simple single

Single point of failures everywhere

Problem analysis

Downtime history

Complexity & Personnel situation

Why hiring people is not an option

Costs

Skills and motivation

Competing with the volunteers

Who is the boss?

Nobody works 24/7

Alternative options

Why now?

Constraints

Analysis

First steps and struggles

More requirements

Slim it down

We need more data

Full game transparency

A new roadmap

Context

Yesterday morning

More Kubernetes testing

One fuckup rarely comes alone

Conclusions

Preface

Simplification and division of labor: About programming languages and libraries

The software life cycle

The impact on FAF

Voting details

Voting distribution over time

Start of the journey: The chat user list

Mass changes out of nowhere

The IRC restart of Doom

The cache that cached too late

Lessons learned

Recap and next steps

DNS cleanup & Cloudflare magic (download speed issue fix?)

FAF emails leave the stone age

Account activation email

Welcome to FAF email

Rolling Out - Compose to Kubernetes

Why We Transitioned to K8s: Understanding the Changes and Reasons

The non goals and cons of Kubernetes

Unveiling the Goals and Advantages of Kubernetes

Are we finished yet?