Brutus5000

Brutus5000 · 12 Mar 2022, 23:00

Hello everyone,

in the last week we had two major events with key persons of FAF history stepping out of the picture.

Last Sunday on the first regular general meeting of the FAF association we elected a new board. With this election Sheeo effectively steps down from running FAF business for the assocation as a president/board member and previously from the FAForever LLC after more than half a decade.

And today we finally transferred the last (but super import) asset of FAForever: the domain ownership and DNS access. As my dashboard is showing me so nicely I can tell you that Visionik took over the domain from ZePilot on 28th October 2014. For multiple years he invested a lot of time and money to keep things running while slowly stepping out of operational business.

On behalf of the FAForever association and the FAF community I'd like to say: Thank you for your engagement over such a big period of time. Without you FAForever wouldn't be what it is today.

We promise not to break it.

Brutus5000 · 7 Jul 2021, 20:42

Contribution Guidelines

Version from 18.03.2022

Introduction

These contribution guidelines apply to all spaces managed by the FAForever project, including IRC, Forum, Discord, Zulip, issue trackers, wikis, blogs, Twitter, YouTube, Twitch, and any other communication channel used by contributors.

We expect these guidelines to be honored by everyone who contributes to the FAForever community formally or informally, or claims any affiliation with the Association and especially when representing FAF, in any role.

These guidelines are not exhaustive or complete. They serve to distill our common understanding of a collaborative, shared environment and goals. We expect them to be followed in spirit as much as in the letter, so that they can enrich all of us and the technical communities in which we participate.
They may be supplemented by further rules specifying desired and undesired behaviour in certain areas.

The Rules of the FAForever community also apply to contributors.

Specific Guidelines

We strive to:

Be open. We invite anyone to participate in our community. We preferably use public methods of communication for project-related messages, unless discussing something sensitive. This applies to messages for help or project-related support, too; not only is a public support request much more likely to result in an answer to a question, it also makes sure that any inadvertent mistakes made by people answering will be more easily detected and corrected.
Be empathetic, welcoming, friendly, and patient. We work together to resolve conflict, assume good intentions, and do our best to act in an empathetic fashion. We may all experience some frustration from time to time, but we do not allow frustration to turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. We should be respectful when dealing with other community members as well as with people outside our community.
Be collaborative. Our work will be used by other people, and in turn we will depend on the work of others. When we make something for the benefit of the project, we are willing to explain to others how it works, so that they can build on the work to make it even better. Any decision we make will affect users and colleagues, and we take those consequences seriously when making decisions.
Be inquisitive. Nobody knows everything! Asking questions early avoids many problems later, so questions are encouraged, though they may be directed to the appropriate forum. Those who are asked should be responsive and helpful, within the context of our shared goal of improving the FAForever project.
Be careful in the words that we choose. Whether we are participating as professionals or volunteers, we value professionalism in all interactions, and take responsibility for our own speech. Be kind to others. Do not insult or put down other participants. Harassment is not acceptable.
Be concise. Keep in mind that what you write once will be read by dozens of persons. Writing a short message means people can understand the conversation as efficiently as possible. When a long explanation is necessary, consider adding a summary.
Try to bring new ideas to a conversation so that each message adds something unique to the thread, keeping in mind that the rest of the thread still contains the other messages with arguments that have already been made.
Try to stay on topic, especially in discussions that are already fairly large.
Step down considerately. Members of every project come and go. When somebody leaves or disengages from the project they should tell people they are leaving and take the proper steps to ensure that others can pick up where they left off. In doing so, they should remain respectful of those who continue to participate in the project and should not misrepresent the project's goals or achievements. Likewise, community members should respect any individual's choice to leave the project.

Diversity Statement

We welcome and encourage participation by everyone. We are committed to being a community that everyone feels good about joining. Although we may not be able to satisfy everyone, we will always work to treat everyone well.

No matter how you identify yourself or how others perceive you: we welcome you. Though no list can hope to be comprehensive, we explicitly honour diversity in: age, gender identity and expression, sexual orientation, neurotype, race, religion, nationality, culture, language, socioeconomic status, profession and technical ability.

Though we welcome people fluent in all languages, all official FAForever communication is conducted in English. Translations may be provided, but in case of contradictory wording, the English version takes precedence.

Reporting Guidelines

While these guidelines should be adhered to by contributors, we recognize that sometimes people may have a bad day, or be unaware of some of the guidelines in this document. If you believe someone is violating these guidelines, you may reply to them and point them out. Such messages may be in public or in private, whatever is most appropriate. Assume good faith; it is more likely that participants are unaware of their bad behaviour than that they intentionally try to degrade the quality of the discussion. Should there be difficulties in dealing with the situation, you may report your compliance issues in confidence to either:

The president of the FAForever association: admin@faforever.org
Any other board member of the association as listed in our forum.

If the violation is in documentation or code, for example inappropriate word choice within official documentation, we ask that people report these privately to the project maintainers or to the DevOps Team Lead.

Endnotes

This statement was copied and modified from the Apache Software Foundation Code Of Conduct and it’s honoured predecessors.

Brutus5000 · 1 Apr 2025, 06:55

After nearly a decade of relentless pursuit—researching, begging, bribing, schmoozing, annoying, plotting, and exchanging more favors than we care to admit—we’ve finally done it. The impossible has become reality: We have the source code for Supreme Commander: Forged Alliance.

Not only that, but we’ve already cleaned it up and made it compatible with modern systems, running smoothly on Windows 10+ and (of course) Linux.

Starting today, the code is publicly available on GitHub:
https://github.com/FAForever/SCFA_source

Of course, as expected, this comes with a few caveats. We do not have copyright permissions for the game assets, nor do we have official authorization to use the brand. However, this breakthrough still unlocks unprecedented opportunities: we can now fix engine-level bugs, expand the game’s capabilities beyond anything we’ve ever imagined, and push FAF into a new era.

This is more than we ever hoped for. The future of FAF has never looked brighter.

Brutus5000 · 5 Sept 2023, 21:52

It's been almost a year since my last update.

Unfortunately apart from some initial progress with adding a web-based UI not much happened to the ICE adapter repo. I took multiple approaches to de-spaghetti the code, but still I couldn't even make sense of what I'm trying to change there. And everytime I tried to refactor something I ended up in modules that do not even belong there.

The lack of any kind of automated tests and the lack of the ability to run 2 ice adapters in parallel from the same IDE made it impossible for me to gain any progress.

So a month ago I started with a new approach. I tried to use the Ice4J library (the fundamental library that the ICE adapter is built around) in a standalone project and tried to figure out how to use it. Also ChatGPT was a big help, as it creates better docs than the original authors...

Starting from this and then decomposing the ice adapter classes I could iteratively figure out how the ice adapter actually works. So with each iteration I drew my learnings into a diagram until I had a good overview over how it actually works. This can be found here.

Based on these insights I started to slowly build up a brand new, cleaner implementation of the ice adapter in Kotlin. It's far from done yet, but it already has an integration test that connects 2 ice adapters locally, which proves me that this is possible.

The code was published today and can be found in this new repository.

The next step is to achieve functional parity with the java implementation. This might take a few more months though. So stay tuned for the update next year

Brutus5000 · 19 Aug 2021, 23:35

Hello there,

recently I got asked a lot why we can't solve FAFs infrastructure related problems and outages with investing more money into it. This is a fair question, but so far I backed out of investing the time to explain it.

Feel free to ask additional questions, I probably won't cover everything in the first attempt. I might update this post according to your questions.

The implications of an open source project

When Vision bought FAF from ZePilot, he released all source code under open source licenses for a very good reason: Keeping FAF open source ensures that no matter what happens, in the ultimate worst case, other people can take the source code and run FAF on their own.

The FAF principles

To keep this goal, we need to follow a few principles:

All software services used in FAF must be open source, not just the ones written for FAF, but the complementary ones as well. => Every interested person has access to all pieces.
Every developer should be capable to run FAF core on a local machine. => Every interested person can start developing on standard hardware.
Every developer should be capable of replicating the whole setup. => Should FAF ever close down someone else can run a copy of it without needing a master’s degree in FAFology or be a professional IT SysAdmin.
The use of external services should be avoided as they cost money and can go out of business. => Every interested person can run a clone of FAF on every hoster in the world.

Software setup

Since the beginning, but evermore growing, the FAF ecosystem has expanded and included many additional software pieces that interact with what I call the "core components".

As of now running a fully fledged FAF requires you to operate 30 (!!) different services. As stated earlier due to our principles every single one of these is open source too.

As you can imagine this huge amount of different services serves very different purposes and requirements. Some of them are tightly coupled with others, such as IRC is an essential part of the FAF experience, while others can mostly run standalone like our wiki. For some purposes we didn't have any choice what to use, but are happy that there is at least one software meeting our requirements. Some software services were build in a time, when distributed systems, zero-downtime maintenance and high-availability weren't goals for small non-commercial projects. As a result they barely support such "modern" approaches on running software.

Current Architecture

The simple single

FAF was started as a one-man project as a student hobby project. It's main purpose was to keep a game alive that was about to be abandoned by its publisher. At that time nobody imagined that 300k users might register there, that 10 million replays would be stored or that 2100 user login at the same time.

From the core FAF was build to run on a single machine, with all services running as a single monolithic instance. There are lots of benefits there:

From a software perspective this simplifies a few things: Network failures between services are impossible. Latency between services is not a problem. All services can access all files from other services if required. Correctly configured there is not much that can happen.
It reduces the administration effort (permissions, housekeeping, monitoring, updates) to one main and one test server.
There is only one place where backups need to be made.
It all fits into a huge docker-compose stack which also achieves principles #2 and #3.
Resource sharing:
** "Generic" services such as MySQL or RabbitMQ can be reused for different use cases with no additional setup required.
** Overall in the server: Not all apps take the same CPU usage at the same time. Having them on the same machine reduces the overall need for CPU performance.
Cost benefits are huge, as we have one machine containing lots of CPU power, RAM and disk space. In modern cloud instances you pay for each of these individually.

In this setup the only way to scale up is by using a bigger machines ("vertical scaling"). This is what we did in the past.

Single point of failures everywhere

A single point of failure is a single component that will take down or degrade the system as a whole if it fails. For FAF that means: You can't play a game.

Currently FAF is full of them:

If the main server crashes or is degraded because of an attack basically or other reasons (e.g. disk full) all services become unavailable.
If Traefik (our reverse proxy) goes gown, all web services go down.
If the MySQL database goes down, the lobby server, the user service, the api, the replay server, wordpress (and therefore the news on the website), the wiki, the irc services go down as well.
If the user service or Ory Hydra goes down, nobody can login with the client.
If the lobby server goes down, nobody can login with the client or play a game.
If the api goes down, nobody can start a new game or login into some services, also the voting goes down.
If coturn goes down, current games die and no new games can be played unless you can live with the ping caused by the roundtrip to our Australian coturn server.
If the content server goes down, the clients can't download remote configuration and the client crashes on launch. Connected players can't download patches, maps, mods or replays anymore.

At first, the list may look like an unfathomable catalogue of risks, but in practice the risk on a single server is moderate - usually.

Problem analysis

FAF has many bugs, issues and problems. Not all of them are critical, but for sure all of them are annoying.

Downtime history

We had known downtimes in the past because of:

Global DNS issues
Regular server updates
Server performance degrading by DoS
Server disk running full
MySQL and/or API overloaded because of misbehaving internal services
MySQL/API overload because of bad caching setup in api and clients
MySQL and/or API overload because too many users online

The main problem here, is to figure out what actually causes a downtime or degradation of service. It's not always obvious. Especially the last 3 items look pretty much the same from a server monitoring perspective.

The last item is the only one that can be solved by scaling to a more to a more juicy server! And even than in many cases it's not required but can be avoided by tweaking the system even more.

I hope it comes clear that just throwing money at a more juicy server can only prevent a small fraction of downtime reason from happening.

Complexity & Personnel situation

As aforementioned, FAF runs on a complex setup of 30 different services on a docker compose stack on a manually managed setup.
On top of that we also develop our own client using these services.

Let's put this into perspective:
I'm a professional software engineer. In the last 3 years I worked on a b2b product (=> required availability Mo-Fr 9 to 5, excluding public holidays) that has 8 full time devs, 1 full time linux admin / devops expert and 2 business experts and several customer support teams.

FAF on the other hand has roughly estimated 2-3x the complexity of said product. It runs 24/7 and special attention on public holidays. It uses

We have no full time developers, we have no full time server admins, we have no support team. There are 2 people with server access who do that in their free time after their regular work. There are 4-5 recurring core contributors working on the backend and a lot more rogue ones who occasionally add stuff.

(Sometimes it makes me wonder how we ever got this far.)

Why hiring people is not an option

It seems obvious that if we can't solve problems by throwing money for a better server, then we maybe should throw money at people and start hiring them.

This is a multi-dimensional problem.

Costs

I'm from Germany and therefore I know that contracting an experienced software developer as a freelancer from west-europe working full time costs between 70k-140k€ per year.
That would be 5800€-11500€ a month. Remember: Our total Patreon income as of now is roundabout 600€.

So let's get cheaper. I'm working with highly qualified romanian colleagues and they are much cheaper. You get the same there for probably 30k € per year.
Even a 50% part time developer would exceed the Patreon income twice with 1250€ per month.

Skills and motivation

Imagine you are the one and only full time paid FAF developer. You need a huge skillset to work on everything:

Linux / server administration
Docker
Java, Spring, JavaFX
Python
JavaScript
C#
Bash
SQL
Network stack (TCP/IP, ICE)
All the weird other services we use

This takes quite a while and at the beginning you won't be that productive. But eventually once you master all this or even half of it, your market value has probably doubled or tripled and now you can earn much more money.

If you just came for the money, why would you stay if you can now earn double? This is not dry theory, for freelancers this is totally normal and in the regions such as east-europe this is also much more common for regularly paid employees.

So after our developer went through the pain and learned all that hard stuff, after 2,3 years he leaves. And the cycle begins anew.

Probably no external developer will ever have the intrinsic motivation to stay at FAF since there is no perspective.

Competing with the volunteers

So assume we hired a developer and you are a senior FAF veteran.

Scenario 1)

Who's gonna teach him? You? So he get's paid and you don't? That's not fair. I leave.

This is the one I would expect from the majority of contributors (even myself). I personally find it hard to see Gyle earning 1000€ a month and myself getting nothing (to be fair I never asked to be paid or opened a Patreon so there is technically no reason to complain).

Scenario 2)

Woah I'm so overworked and their is finally someone else to do it. I can finally start playing rather doing dev work.

In this scenario while we wouldn't drive contributors away, the total work done might still remain the same or even decline.

Scenario 3)

Yeah I'm the one who got hired. Cool. But now it's my job and I don't want to lose fun. Instead of working from 20:00 to 02:00 I'll just keep it from 9 to 5.

This is what would probably happen if you hired me. You can't work on the same thing day and night. In total you might invest a little more time, but not that much. Of course you are still more concentrated if it's your main job rather then doing it after your main job.

Who is the boss?

So assume we hired a developer. Are the FAF veterans telling the developer what to do? Or is the developer guiding the team now? Everybody has different opinions, but now one dude has a significant amount of time and can push through all the changes and ignore previous "gentlemen agreements".

Is it the main task to merge the pull requests of other contributors? Should he work only on ground works?

This would be a very difficult task. Or maybe not? I don't know.

Nobody works 24/7

Even if you hire one developer, FAF still runs 24/7. So there is 16 hours where theres no developer available. A developer is on vacation and not available during public holidays.

One developer doesn't solve all problems. But he creates many new ones.

Alternative options

So if throwing money at one server doesn't work and hiring a developer doesn't work, what can we do with our money?
How about:

Buying more than one server!
Outsource the complexity!

How do big companies achieve high availability? They don't run stuff critical parts on one server, but on multiple servers instead. The idea behind this is to remove any single point of failures.

First of all Dr. Google tells you how that works:

Dr. Google: Instead of one server you have n servers. Each application should run on at least two servers simultaneously, so the service keeps running if one server or application dies.

You: But what happens if the other server or application dies too?

Dr. Google: Well in best case you have some kind of orchestrator running that makes sure, that as soon as one app or one server dies it is either restarted on the same server or started on another server.

You: But how do I know if my app or the server died?

Dr. Google: In order to achieve that all services need to offer a healthcheck endpoint. That is basically a website that the orchestrator can call on your service, to see if it is still working.

You: But now domains such as api.faforever.com need to point to two servers?

Dr. Google: Now you need to put in a loadbalancer. That will forward user requests to one of the services.

You: But wait a second. If my application can be anywhere, where does content server read the files from? Or the replay server write them to?

Dr. Google: Well, in order to make this work you no longer just store it on the disk of the server you are running on, but on a storage place in the network. It's called CEPH storage.

You: But how do I monitor all my apps and servers now?

Dr. Google: Don't worry there are industry standard tools to scrape your services and collect data.

Sounds complicated? Yes it is. But fortunately there is an industry standard called Kubernetes that orchestrates service in a cluster on top of Docker (reminder: Docker is what you use already).

You: Great. But wait a second, can Kubernetes you even run 2 parallel databases?

Dr. Google: No. That's something you need to setup and configure manually.

You: But I don't want to do that?!

Dr. Google: Don't worry. You can rent a high-available database from us.

You: Hmm not so great. How about RabbitMQ?

Dr. Google: That has a high availability mode, it's easy to setup it just halves your performance and breaks every now and then... or you just rent it from our partner int the marketplace!

You: Ooookay? Well at least we have a solution right? So I guess the replay server can run twice right?

Dr. Google: Eeerm no. Imagine your game has 4 players and 2 of them end up on the one replay server and 2 end up on the other. So why don't you store 2 replays?

You: Well that might work hmm no idea. But the lobby server. That will work right?

Dr. Google: No! The lobby server keeps all state in memory. Depending on which one you connect to, you will only see the games on that one. You need to put the state into redis. Your containers need to be stateless! And don't forget to have a high available redis!

You: Let me guess: I can find it on your marketplace?

Dr. Google: We have a fast learner here!

You: Tell me the truth. Does clustering work with FAF at all?

Dr. Google: No it does not, at least not without larger rewrites of several applications. Also many of the apps you use aren't capable of running in multiple instances so you need to replace them or enhance them. But you want to reduce downtimes right? Riiiight?

-- 4 years later --

You: Oooh I finally did that. Can we move to the cluster now?

Dr. Google: Sure! That's 270$ a month for the HA-managed MySQL database. 150$ for the HA-managed RabbitMQ, 50$ for the HA-managed redis, 40$ a month for the 1TB cloud storage and 500$ for the traffic it causes. Don't forget your Kubernetes nodes, you need roundabout 2x 8 cores for 250$, and 2* 12GB RAM for 50$, 30$ for the stackdriver logging. That's at total of 1340$ per month. Oh wait you also needed a test server...??

You: But on Hetzner I only paid 75$ per month!!

Dr. Google: Yes but it wasn't managed and HIGHLY AVAILABLE.

You: But then you run it all for me and it's always available right?

Dr. Google: Yes of course... I mean... You still need to size your cluster right, deploy the apps to kubernetes, setup the monitoring, configure the ingress routes, oh and we reserve one slot per week where we are allowed to restart the database for upd... OH GOD HE PULLED A GUN!!

Brutus5000 · 1 Oct 2022, 23:11

This article describes a project around the FAF ICE adapter. If you have no clue what it is I'd recommend to read this blog post first.

Some of you might think now: "Dude, why do you have time to blog? We have bigger problems! Fix the gmail issue right now!" Unfortunately my life is very constrained due to my very young children, so sometimes I can work on little projects but can't tackle server side things.

Why now?

After staying away from it successfully I recently started digging into the inner workings of the ICE adapter. There are plenty of reasons that led to this decision:

The ICE adapter is a critical part of the infrastructure. But only the original author knows the code base and claimed it to be in a bad shape (which after thorough analysis can agree to). Both are basically unacceptable facts for the long-running health of FAF.
We tried adding more coturn servers to improve the situation on non-Europe continents, but we're facing some issues that could be best solved in the ICE adapter itself.
The previous author of the ICE adapter has a serious lack of time to implement features.
The ICE adapter still relied on Java 8 (the last release of Java where the JavaFX ui libraries where bundled to the Java release), but all other pieces are already on Java 17 (!). Right now it only works in the Java client due to some dirty hacks.

Constraints

The ICE adapter is a very fragile piece of software as we have learnt with some attempted changes that required a rollback to previous state more than once. The problem here is that even with intensive testing in the past with 20+ users, we still encountered users with issues that never occured during testing. Every computer, every system, every user has a different setup of hardware, operating system (and patch level), configuration, permissions, anti virus, firewall setups, internet providers and so on.

Every single variable can break connectivity and we will never know why.
This led to the point that the fear of breaking something pushed us back from adding potential improvements.

Analysis

Before I started refactoring I went through the code to gain a better understanding and noticed a few points:

the release build process still relied on Travis CI which no longer works
many libraries the ice adapter is built on are outdated
we forked some libraries, made some changes and then never kept up with the upstream changes
some code areas reinvent the wheel (e.g. custom argument parsing)
ice adapter state is shared all over the application in static variables with no encapsulation
a lot threads are spawned for simple tasks
a lot of thread synchronization is going on as every thread can (and also does) modify the static (global) state

Almost none of this is related to actual ICE / networking logic. So improving the code here would make maintaining it easier and would also make it easier for future developers to dive into the code without much risk of breaking anything.

First steps and struggles

First of all I didn't want to continue developing on Java 8, as it reached its end-of-life now and the language itself made some nice progress in the last 6 years. So I migrated to JDK17 which meant also fixing the library situation for the JavaFX ui. Before JavaFX was bundled with the JDK, now it comes as it's own libraries. That has a drawback though: The libraries are platform specific, thus we need to build now a windows and a linux version.
Handling the platform specific libraries also made me migrate the build pipeline from Travis CI to Github actions (as almost all FAF projects are by now). Now we also have a nice workflow in the Github UI to build a release.
When trying to integrate the new version into the client I found out about the hacky way how we made the current ICE adapter working with the old Java 8 version despite not having JavaFX on board. Actually the javafx libraries of the client were passed to the ICE adapter. So I could use that too! But that needed a 3rd release without JavaFX libraries inside. This required further changes to the build pipeline (we still need dedicated win/linux versions for non-java clients!).
When testing the new ICE adapter release I was surprised as I could no longer open the debug window. But it turned out to be broken all along on previous versions. The code to inject the JavaFX libraries into the ICE adapter did not take into account, that the Java classpath separator for multiple files is different on Windows (Semicolon) and Linix (colon). So I actually fixed that, hurray!
I replaced the custom argument parsing with a well-used library called PicoCLI. This makes reading and adding new command line arguments in code much easier.

The switch to Java 17 is a potential breaking change. Thus the 3 changes above already ended in a new release that will be shipped to you probably with the next client release.

My next attempt was to remove all the static variables and encapsulate the state of the application to avoid a lot of the multi-threaded problems that potential lurk everywhere. However doing this I struggled mainly because the ICE adapter <-> JavaFX usage:

the ICE adapter has an unusual way of launching it's GUI after the application is already running
Java UI always needs to run in a separate main thread some weird
JavaFX doesn't give you a handle on the application window you launch and you can't pass it arguments

Also the UI debug logic leaks into every component and tries to grab data from everywhere. So a good and uncritical refactoring would be to rewrite the UI part...

More requirements

Slim it down

I already mentioned that we have to consider the non-Java clients. @Eternal is building a new one and there is also still the Python client. For the non-java clients the ICE adapter is a pain for packaging, because they need to ship a Java Runtime (~100mb) + an ICE adapter with UI libraries (~50mb) for a very "tiny" tool.

Eternal recently asked whether it is possible to ship a lighter Java runtime. Unfortunately the current answer is "it's to much effort". Actually the Java ecosystem has acquired features to build native executables from Java application (via the new GraalVM compiler) and this was also extended for JavaFX applications (Gluon Substrate). However with JavaFX this is very complicated and requires a lot of knowledge and experience we don't have.

It would be a more realistic goal if the ICE adapter wouldn't require a Java GUI. As someone who was recently involved with some more Web development I was thinking about shipping a browser-based GUI connected to the ICE adapter via WebSocket.

We need more data

When I was designing a websocket protocol for a ice adapter <-> communication I was struck by an interesting idea.

What if we could use this data stream to track the connectivity state and gather some telemetry? This could give us insight about which connections are used most, which regions struggle the most, or if an update made things worse.

Thus I started working on a Telemetry Service that would be capable of collecting all the data of the ice adapters.

Full game transparency

But the idea started to mutate even further. Why would you want to see only your ICE adapter debug information? Maybe you want to see where a connection issue is happening right now between other players.

Also why would I bundle the UI logic with an ice adapter release, when it could be a centrally deployed web app, that can be updated independent from ice adapter releases!

So in this scenario the ICE adapter sends all of its data to the telemetry server. Players then connect to the telemetry server ui and can see a debug view of all players connected to the game and each other.

This is what I've been working on the last 3 weeks, and it's in a state where we can replace the ui and see the whole game state for all players. But we all know: Pics or didn't happen, so here is a current screenshot of the future ui (with fake data):

A new roadmap

So here is the battle plan for the future:

Release the Java17 ICE adapter to the world.
Finish the basic telemetry server and ice adapter logic and ship it for testing (keep the old debug ui for comparison)
Persist telemetry data into some meaningful KPIs, so we can observe the impact of new ice adapter versions
Drop the old debug UI and continue refactoring the ICE adapter into a better non-static architecture
Update the core ICE libraries and see if things improve
Try building native executables for the ice adapter

Are you interested to join this quite new project? (The telemetry server is really small, written in Kotlin with Micronaut framework). This is your chance to get into FAF development on something with comprehensible complexity! Contact me!

Brutus5000 · 20 Sept 2023, 16:13

As many of you may have noticed yesterday was a bad day for FAF.

Context

In the background we are currently working on a migration of services from Docker to Kubernetes (for almost two years by now actually...). Now we were in a state where we wanted to migrate the first services (in particular: the user service (user.faforever.com) and ory hydra (hydra.faforever.com). In order to do this we needed to make the database available in Kubernetes.

This however is tricky: in Docker every service has a hostname. faf-db is the name of our database. It also has an ip-adress but that ip address is not stable. The best way to make a docker service available for Kubernetes on the same host, is to expose the database on the host network. But right now the database is only available on the host from 127.0.0.1, not from inside the Kubernetes network. This required a change to the faf-db container and would have caused a downtime. As an alternative we use a tcp proxy bound to a different port. As a result a test version of our login services were working, while the database pointed to the proxy port. Now we planned expose the actual MariaDB port with the next server restart...

Another thing to know:
We manage all our Kubernetes secrets in a cloud service called infisical. You can managed secrets for multiple environment there, and changes are directly synced to the cluster. This simplifies handling a lot.

Yesterday morning

It all started with a seemingly well-known routine called server restart.
We had planned it because the server was running multiple months without restart aka unpatched Linux kernel.

So before work I applied the change and restarted the server.

Along with the restart we applied the change as described above: we made the MariaDB database port available for the for everybody on the network and not just 127.0.0.1. It is still protected via firewall, but this changed allowed it to use it from our internal K8s.

That actually worked well... or so I thought..

More Kubernetes testing

Now with the docker change in place I wanted to test if our login services now work on Kubernetes too. Unfortunately I made two changes which had much more impact than planned.
First I updated the connection string of the login service to use the new port. Secondly I absent-minded set the endpoint of the user service to match the official one so e.g. user.faforever.com now pointed to k8s. thirdly I set the environment to K8s because this shows up in the top left of the login screen for all places except production.

Now we have two pair of components running

A docker user service talking to a docker Ory Hydra
A K8s user service talking to a K8s Ory Hydra

What I wasn't aware (this is all new to us):

If an app from Docker and an app from K8s compete for the same DNS record, the K8s app wins. So all users where pointed to the k8s user service talking to the K8s Ory Hydra.
By changing the environment, I also changed the place, where our Kubernetes app "infisical" tries to download it's secrets. So now it pointed to an environment "K8s" which didn't exist and didn't have secrets. Thus the updated connection string could not be synced with K8s, leaving Ory Hydra with a broken connection string incapable of passing through logins.

So there were two different errors stacked on top of each other. Both difficult to find.

One fuckup rarely comes alone

Unfortunately in the meantime yet ANOTHER error occured. We assume that the operating system for some reason ran out of file descriptor or something causing weird errors, we are still unsure. The effect was this:

The docker side Ory Hydra was still running as usual. For whatever reason it could no longer reach the existing database, even after a restart. We have never seen that error before, and we still don't know what caused it.
Also the IRC was suddenly affected kicking users out of the system once it reached a critical mass, leading to permanent reconnects from all connected clients leading to even more file descriptors created...

So now we had stacked 3 errors stacked on top of each other and even rolling back didn't solve the problem.

This all happened during my worktime and made it very difficult to thoroughly understand what was going on or easily fix it.

But when we finally found the errors we could at least fix the login. But the IRC error persisted, so we shut it down until the next morning when less people tried to connect.

Conclusions

The FAF client needs to stop instantly reconnect to IRC after a disconnect. There should be a waiting time with exponential backoff, to avoid overloading IRC. (It worked in the past, we didn't change it, we don't know why this is an issue now...)
The parallel usage of Docker and Kubernetes is problematic and we need to intensify our efforts to move everything.
More fuckups will happen because of 2., but we have to keep pushing.
Most important: The idea to make a change when less users are online is nice, but it conflicts with my personal time. The server was in broken state for more than half a day because I didn't have time to investigate (work, kids). The alternative is, to make these changes when I have free time: at the peak time of faf around 21.00-23.00 CET. This affects more users, but shortens troubleshooting time. What do you think? Write in the comments.

Brutus5000 · 21 Sept 2020, 20:30

Sometimes people ask us: Can't you just stop changing things and leave FAF as it is? The answer is no, if we do that FAF will eventually die. And today I'd like to explain why.

Preface

Do you still use Windows XP? According to some FAF users the best operating system ever. Or Winamp? Or Netscape Navigator? Or a device with Android 2.3?

Probably not. And with good reason: Software gets old and breaks. Maybe the installation fails, maybe it throws weird errors, maybe it's unusable now because it was build for tiny screen resolutions, maybe it depends on an internet service that no longer exists. There are all sorts of reasons that software breaks.

But what is the cause? And what does that mean for FAF?

Simplification and division of labor: About programming languages and libraries

People are lazy and want to make their lives easier. When the first computers were produced, they could only be programmed in machine language (assembler). In the 80s and 90s some very successful games like Transport Tycoon were written in assembler. This is still possible, but hardly anyone does it anymore. Effort and complexity are high and it works only on the processor whose dialect you program.

Nowadays we write and develop software in a high level language like C++, Java or Python. Some smart people then came up with the idea that it might not make much sense to program the same thing over and over again in every application: Opening files, loading data from the internet or playing music on the speakers. The idea of the library was born. In software development, a library is a collection of functions in code that any other developer can use without knowing the content in detail.

These libraries have yet another name, which sheds more light on the crux of the matter: dependencies. As soon as I as a developer use a library, my program is dependent on this library. Because without the library I cannot build and start my application. In times of the internet this is not a problem, because nothing gets lost. But the problem is a different one, we will get to that now.

The software life cycle

Even if it sounds banal, every piece of software (including the libraries mentioned) goes through a life cycle.
At the very beginning, the software is still very unstable and has few features. Often one speaks also of alpha and beta versions. This is not relevant for us, because we do not use them in FAF.

After that a software matures. More features. More people start using them. What happens? More bugs are found! Sometimes they are small, e.g. a wrong calculation, but sometimes they are big or security related problems. Those that crash your computer or allow malicious attackers to gain full access to the computer they are running on. Both on the FAF Server and on your computer at home a nightmare. So such bugs have to be fixed. And now?

Scenario A:
A new release is built. But: A new release of a dependency alone does not solve any problems. It must also be used in the applications that build on it! This means that all "upstream" projects based on it must also build a new release. And now imagine you use library X, which uses library Y, which in turn uses library Z. This may take some time. And 3 layers of libraries are still few. Many complex projects have dependencies up to 10 levels deep or more.

Scenario B:
There is no new release.

The company has discontinued the product, has another new product or is bankrupt.
The only developer has been hit by a bus or is fed up with his community and now only plays Fortnite.

Finally, all commercial software will end up in scenario B at the end of its life cycle. And in most cases open source software also builds on top commercial software directly or indirectly.

Just a few examples:

All Windows versions before Windows 10 are no longer developed. They have known security issues and you are advised to no longer use them.
The latest Direct X versions are only available on the latest Windows
Almost all Firefox versions older than 1 release are no longer supported (with a few exceptions)

What happens at the end of the lifecycle?
For a short period of time, probably nothing. But at some point shit hits the fan. Real world examples:

When people upgrade their operating system to Windows XP or newer some older Install Shield Wizards doesn't work anymore. Suddenly your precious Anno 1602 fails to install.
Your software assume the users Windows installation has a DVD codec or some ancient weird video codec to be installed, but Microsoft stopped shipping it in Windows 10 to save a few bucks.
There is an incompatibility in a version of Microsofts Visual C++ redistributable (if you ever wondered what that is, it's a C++ library for Windows available in a few hundred partially incompatible versions)

The impact on FAF

FAF has hundreds of dependencies. Some are managed by other organisations (e.g. the Spring framework for Java handles literally thousands of dependencies), but most are managed by ourselves.

A few examples that have cost us a lot of effort:

Operating system upgrades on the server
Python 2 is no longer supported, Python 3 is only supported until version 3.4 (affects Python client, lobby server, replay server)
Qt 4 was no longer supported (affected Python client), we needed to migrate to Qt5
All Java versions prior to 11 are no longer supported by Oracle (concerns API and Java client)
Windows updates affects all clients
Microsofts weird integration of OneDrive into Windows causes weird errors to pop up

Many of these changes required larger changes in the software codebases and also impacted the behavior of code. As source of new bugs to arise.

If we would freeze time and nothing would change, then all this would be no problem. But the software environment changes, whether we as developers want it to or not. You as a user install a new Windows, you download updates, you buy new computers, there is no way (and no reason) for us to prevent this.

And we must and want FAF to run on modern computers. And of course we want to make bug fixes from our dependencies available to you. So we need to adapt. FAF is alive. And life is change. But unfortunately in software change also brings new errors.

Everytime we upgrade a dependency we might introduce new bugs. And since we're not a million dollar company, we have no QA team to find this bugs before shipping.

Brutus5000 · 9 Apr 2021, 20:44

The ICE adapter disagrees.

Brutus5000 · 6 Jul 2021, 10:21

Hello everybody,

we apologize for the technical issues in the last 2 days. Nevertheless the vote ended and Morax is the official winner of the election.

We thank FtxCommando for his service in the last 2 years.

Voting details

The mode of vote was instant run off. So every round we eliminate the candidate with the least votes and transfer his votes to whatever the voter has defined as the next fallback vote.
Since we only had 3 candidates that makes it fairly easy to lookup.

Results of the 1st iteration (primary votes):

Votes	Candiate
289	Morax
201	Emperor_Penguin
175	FtXCommando
4	nobody

FtXCommando is last and his 175 votes are transferred to:

Votes	Candiate
96	Morax
46	Emperor_Penguin
33	nobody

This gives us results of the 2nd iteration:

Votes	Candiate
385	Morax
247	Emperor_Penguin

As it's only 2 candidates left, the one with the majority wins. In this case it's Morax.

Voting distribution over time

Also people wanted to know when was voted and if we can shorten future voting periods. Here is some data:

Day of Vote	# of votes	% of total votes	Accumulative %
1	283	42,2 %	42,2 %
2	79	11,8 %	54,0 %
3	31	4,6 %	58,7 %
4	10	1,5 %	60,1 %
5	10	1,5 %	61,6 %
6	14	2,1 %	63,7 %
7	8	1,2 %	64,9 %
8	7	1,0 %	66,0 %
9	4	0,6 %	66,6 %
10	8	1,2 %	67,8 %
11	6	0,9 %	68,7 %
12	5	0,7 %	69,4 %
13	4	0,6 %	70,0 %
14	5	0,7 %	70,7 %
15	15	2,2 %	73,0 %
16	7	1,0 %	74,0 %
17	6	0,9 %	74,9 %
18	75	11,2 %	86,1 %
19	23	3,4 %	89,6 %
20	7	1,0 %	90,6 %
21	9	1,3 %	91,9 %
22	10	1,5 %	93,4 %
23	11	1,6 %	95,1 %
24	10	1,5 %	96,6 %
25	5	0,7 %	97,3 %
26	1	0,1 %	97,5 %
27	7	1,0 %	98,5 %
28	3	0,4 %	99,0 %
29	4	0,6 %	99,6 %
30	3	0,4 %	100,0 %

Brutus5000

@BlackYps said in Another dumb idea from Dorset:

From my experience, most things stall because people have ideas, but there is nobody to implement the idea.

I don't think this is correct though. My feeling is there are lot of people who could implement any idea, but why on earth would they spent time on the ideas of other people instead of implementing their own ideas. Implementing own ideas is the reason why most people learned coding or modding on the first place. So they are not to blame for it.

Brutus5000

nope. I joined faf development 2015, gw was long gone by then. probably 2013-2014 last season

Brutus5000 · 1 Apr 2025, 06:55

After nearly a decade of relentless pursuit—researching, begging, bribing, schmoozing, annoying, plotting, and exchanging more favors than we care to admit—we’ve finally done it. The impossible has become reality: We have the source code for Supreme Commander: Forged Alliance.

Not only that, but we’ve already cleaned it up and made it compatible with modern systems, running smoothly on Windows 10+ and (of course) Linux.

Starting today, the code is publicly available on GitHub:
https://github.com/FAForever/SCFA_source

Of course, as expected, this comes with a few caveats. We do not have copyright permissions for the game assets, nor do we have official authorization to use the brand. However, this breakthrough still unlocks unprecedented opportunities: we can now fix engine-level bugs, expand the game’s capabilities beyond anything we’ve ever imagined, and push FAF into a new era.

This is more than we ever hoped for. The future of FAF has never looked brighter.

Brutus5000 · 1 Apr 2025, 06:50

A long time ago in a galaxy of endless connectivity struggles

It is a dark time for the
FAForever project. Although the attacks
on the main server have been circumvented,
DDoS troops have driven the
DevOps forces from their hidden
mains server and pursued them across
the coturn galaxy.

Evading the dreaded bot network,
a group of developers led by Brutus5000
have established a new secret
project aiming to the remote ice world
of Cloudflare.

Entering the stage: The FAF pioneer.

The FAF pioneer is a completely new approach on the ice adapter.
We ditched our existing java implementation based on the ice4j library as we found several bugs we were unable to resolve. Instead we jumped on the Golang train and use the pion WebRTC library which seems in a much better state.

Generally speaking, WebRTC is a set of protocols most audio and video conferencing work with. The existing ICE protocol we were using is also used by WebRTC, but WebRTC goes beyond that and tries to solve other problems. Relevent for us in particular are reconnect features, packet completeness guarantees and stable packet order.

Our project reached a huge milestone 2 days ago with a first MVP release that will be the basis for future testing. Unfortunately testing is only possible with special FAF clients, so it is not out in the open for testing. Since it is incompatible with the existing ice adapter, there will be a need for the big bang migration.

The biggest goal is eventually to be able to use Cloudflare as a low-latency and low-cust TURN server battery. Unfortunately the current ice adapter bugs prevent us from using Cloudflare entirely.

If you are interested, check out the code at https://github.com/FAForever/faf-pioneer

Brutus5000 · 16 Mar 2025, 17:51

The API can return the list, the client just needs to show it.

Brutus5000 · 15 Mar 2025, 16:49

There is no way to do this now in the client. There is a pending feature request https://github.com/FAForever/downlords-faf-client/issues/3168

Brutus5000 · 9 Mar 2025, 09:08

I can verify that the person responsible behind the DDoS claimed to target individual player ips to.

We are working on measures to mitigate this issue, but it will take time.

Until then tell your ISP that they need to protect you from DDoS not the other way round. It's 2025 and every bad actor can pull this off as soon as they get your ip...

Brutus5000 · 8 Mar 2025, 14:10

And FAF will continue providing revenue for banned people buying the game again yD

Brutus5000 · 1 Mar 2025, 18:43

Make sure after uninstall to delete the remaining files from the program files folder. This is a common problem for whatever reason the installer does not always clean up properly.

Brutus5000 · 28 Feb 2025, 21:50

Around October 2023 hostile ex-community members launched the first wave of DDoS attacks. The FAF infrastructure and application landscape was not prepared for that. Services directly connecting to our main servers on many open ports, openly accessible APIs for the benefit of the community. People could run their own IRC bots. People built API scrapers and analytics tools. All of that had to vanish basically overnight.

In a rush the FAF team closed down open ends as good as possible. We migrated the lobby connection from raw TCP to web sockets, we put the formerly open API behind authentication. We changed the IRC server in favour of an implementation supporting HTTP based access. And then we hid almost all services behind Cloudflare proxy servers except for the server itself that was still reachable from the internet.

The weak point remained our ICE server aka Coturn servers. So we started paying for external TURN as a service provider and added more infrastructure around. But the feedback on connectivity was never good. Things never got back to where it was. I started digging into the ice adapter more than any before. Documented features. Tried to refactor it. Tried to rewrite it. We even tried to integrate Cloudflares new TURN service as it went live. The connectivity was horrible and the used payment according to Cloudflare would have ruined FAF financially within a month, so we had to disable it, not knowing what actually happened there.  At some point the DDoS more or less stopped and things settled a little more to normal and the topic of ICE fell into background noise.

Now for over two weeks now DDoS is back with cyber-terrorist alike demands (we either abide to the attackers terms or the DDoS continues forever). But the FAF team stance here is clear: We do not negotiate with terrorists.

Instead we continued hardening our servers. Our main server is no longer reachable directly from the internet, and has to pass multiple firewalls. Yet the bottleneck once again is the ICE connectivity.

So with the accumulated knowledge of the last years, we investigated and analysed the shit out of the ICE adapter with more tooling and (semi-)reproducible test methods. The results were not promising in multiple combinations:

Our Hetzner cloud servers have huge packet loss even outside DDoS (tested on fresh VMs) on ICE-related communication. We don’t know why, but it seems that Hetzner really doesn’t like this kind of traffic. Potential solutions: (a) Report packet loss to Hetzner in a structured way, (b) use ports that are for other traffic and thus more stable, (3) use a different provider
The „coturn“ software spew non-stop errors but was completely useless in logging why these errors occurred. So we tried out a different software called „eturnal“ (love the pun here), which gave us a better hint about problems
A Wireshark capture of a user trying to connect to Cloudflare showed us a single successful connection attempt followed by 80000 (!!) failed connection attempts in a 5 minute interval.

So apart from the Hetzner issues, we could boil it down to problems in the ice adapter. The ice adapter at its core is built around the „ice4j“ library. This is a piece of software that originally built for the Jitsi phone software (even though it was renamed a few times and by now is a commercial service). The only maintainers are Jitsi developers and as such the focus lies on the features of Jitsi. There is a component called Jitsu video bridge that is also open source. When we looked into it, we saw that Jitsi is not using TURN at all, and therefore not a big priority in ice4j. The code of ice4j has no documentation outside of regular Javadocs. And it looks like it was written in a C-programming style from the 90s (while Jitsi components are written in modern Kotlin). The worst part however is, that it is not possible to control or configure from the calling code.

This should not be a problem if the library does what it should. But as far as we can see by now, the TURN code of ice4j does not behave like it should. Whether it is violating the specifications is beyond my understanding. A single example that we could identify is that ice4j tries to establish a TURN session for all the ip addresses it could find: the external ip (that one makes sense), but also all internal network ips (nope, that does not make sense!). And this in particular is the reason why the ice adapter causes an endless log stream of errors in coturn: the attempt to establish a turn connection for a private network address is causing an authentication error… (ok - both software stacks here behave like idiots). That might also be the reason why there are so many Cloudflare login attempts? We don’t know.

What we do know is that we can rewrite the ice adapter all we want. As long as we choose Java or Kotlin we are bound to ice4j as it is the only notable library for ICE.

What we do know, is that ice is a low level protocol which is used by WebRTC. Which is used by every single browser and every single audio/video conferencing tool that runs in a browser. WebRTC is everywhere and its „data channel“ feature allows features that we wished for in the ice adapter in a long time (guaranteed and order submission of packets, keep alive functionality). So why go with ICE alone when we can have WebRTC with ice?

Now, there is a striving project for WebRTC called Pion with 14k stars on Github (ice4j: 500) and around 200 contributors (ice4j: 25) and lots and lots of example code. So where is the catch? The catch is: It is written in Go. And we have no Go developers at FAF - I never used it so far.

So what do you think: Should I stay (on ice4j) or should I Go (learning Go)?