FAF DevOps Councilor since 2018
FAF developer since 2016
FAF member since 2013
Sometimes people ask us: Can't you just stop changing things and leave FAF as it is? The answer is no, if we do that FAF will eventually die. And today I'd like to explain why.
Do you still use Windows XP? According to some FAF users the best operating system ever. Or Winamp? Or Netscape Navigator? Or a device with Android 2.3?
Probably not. And with good reason: Software gets old and breaks. Maybe the installation fails, maybe it throws weird errors, maybe it's unusable now because it was build for tiny screen resolutions, maybe it depends on an internet service that no longer exists. There are all sorts of reasons that software breaks.
But what is the cause? And what does that mean for FAF?
People are lazy and want to make their lives easier. When the first computers were produced, they could only be programmed in machine language (assembler). In the 80s and 90s some very successful games like Transport Tycoon were written in assembler. This is still possible, but hardly anyone does it anymore. Effort and complexity are high and it works only on the processor whose dialect you program.
Nowadays we write and develop software in a high level language like C++, Java or Python. Some smart people then came up with the idea that it might not make much sense to program the same thing over and over again in every application: Opening files, loading data from the internet or playing music on the speakers. The idea of the library was born. In software development, a library is a collection of functions in code that any other developer can use without knowing the content in detail.
These libraries have yet another name, which sheds more light on the crux of the matter: dependencies. As soon as I as a developer use a library, my program is dependent on this library. Because without the library I cannot build and start my application. In times of the internet this is not a problem, because nothing gets lost. But the problem is a different one, we will get to that now.
Even if it sounds banal, every piece of software (including the libraries mentioned) goes through a life cycle.
At the very beginning, the software is still very unstable and has few features. Often one speaks also of alpha and beta versions. This is not relevant for us, because we do not use them in FAF.
After that a software matures. More features. More people start using them. What happens? More bugs are found! Sometimes they are small, e.g. a wrong calculation, but sometimes they are big or security related problems. Those that crash your computer or allow malicious attackers to gain full access to the computer they are running on. Both on the FAF Server and on your computer at home a nightmare. So such bugs have to be fixed. And now?
A new release is built. But: A new release of a dependency alone does not solve any problems. It must also be used in the applications that build on it! This means that all "upstream" projects based on it must also build a new release. And now imagine you use library X, which uses library Y, which in turn uses library Z. This may take some time. And 3 layers of libraries are still few. Many complex projects have dependencies up to 10 levels deep or more.
There is no new release.
Finally, all commercial software will end up in scenario B at the end of its life cycle. And in most cases open source software also builds on top commercial software directly or indirectly.
Just a few examples:
What happens at the end of the lifecycle?
For a short period of time, probably nothing. But at some point shit hits the fan. Real world examples:
FAF has hundreds of dependencies. Some are managed by other organisations (e.g. the Spring framework for Java handles literally thousands of dependencies), but most are managed by ourselves.
A few examples that have cost us a lot of effort:
Many of these changes required larger changes in the software codebases and also impacted the behavior of code. As source of new bugs to arise.
If we would freeze time and nothing would change, then all this would be no problem. But the software environment changes, whether we as developers want it to or not. You as a user install a new Windows, you download updates, you buy new computers, there is no way (and no reason) for us to prevent this.
And we must and want FAF to run on modern computers. And of course we want to make bug fixes from our dependencies available to you. So we need to adapt. FAF is alive. And life is change. But unfortunately in software change also brings new errors.
Everytime we upgrade a dependency we might introduce new bugs. And since we're not a million dollar company, we have no QA team to find this bugs before shipping.
Many of you won't remember it, but the amount of posts in this forum and the feedback we collected showed us: the onboarding process along with the registration page is a horror for new users (especially non-english native speakers).
Together with our fellow community member Terrorblade, who is a working expert on customer journeys, we worked out weaknesses in the current process and ways to fix them.
Here are few things we are doing horribly wrong:
Due to the Steam link process and the required client download the FAF onboarding process is much more complicated than the that of other websites. We even modelled a perfect workflow, but that's beyond the scope of the current work. Nevertheless for the brave peopel here I'd like to share it with you, without further explanation:
![0_1605396838423_FAF User Journey.png](Uploading 100%)
In the last weeks I had the opportunity in my job to learn frontend development with Angular (and yet more trainings are to come). In order to train my newly gathered skills, I gave it a try. Unfortunately what I still don't have are skills for designing shit with CSS and colors.
All screenshots are representing work in progress:
Failed registration: As long as you didn't fix all issues and selected all checkboxes you can't click the button. The user lookup is done with a 1 second delay
You can dynamically switch the language. Currently I'm only populating english and german. The biggest need however would be russian.
Once you successfully registered we now (and only then) tell you, that you need to click on the email link:
Nevertheless, users are capable of switching between the steps to see whats ahead and what to wait for:
But using the activation url it looks simpler:
So after a successfull activation we can move on:
The Steam linking page needs some more info why this is required I guess:
The steam login itself is not interesting here, so I skip it. After a successfull login we get this:
And the FAF CLient Setup page is yet missing content completely. Not sure if we want to add screenshot per language, that sounds a bit too much work...
So far I have only shown the "happy path". The most important pain point for users is when the Steam linking doesn't work. We need to be able to distinguish between "You don't own the game" and "Your profile seems to be not public", but this needs some API changes as well.
I hope this change will reduce the amount of people we loose on the way to FAF. But it's still a long way and it might need much more time invested. So far I spent around 20 hours to reach this state (yet I'm facing many beginner problems as this is my first bigger Angular project, so it's a lot of basic learning time included).
Maybe I'll do some more Twitch streams and now the 5 people who saw my last (and first) stream know what they actually saw
Please share ideas and feedback guys!
Today two thousand players encountered once again how fragile our whole server setup sometimes is and how an unlikely chain of events can cause major issues. In this talk I want to shed some light on how the chain started moving.
We are currently investigating CPU load issues in the client which we could track down to updates of the chat user list in #aeolus. If you open the chat tab you see a partial list of users, as not the whole list fits to the screen.
We are using a ListView component from our UI framework for this. Each row or ListItem (a user or a group name) is one element in this list view. However, it would be very memory hungry, if the application would render each chat user's list item (~2000 on peak times) if you can only see a few dozen of them.
Therefore the list view component only holds a pool of list items that you can actually see and re-uses / redraws / repopulates them as the list changes.
When profiling the application I could observe hundreds of thousands of updates to the list over a period of 10-15 minutes with about 2.000 users logged in to the chat. Something was causing massive changes to the user list and we couldn't find a reason.
On a different topic we noticed an increasing number of errors when trying to login to the IRC on peak times for no obvious reason as well. Sometimes it took multiple minutes to login.
But then somebody using a regular IRC client looked into the logs and found an interesting thing: a lot of people were logging in and out in a very short period with the quit message "Max SendQ exceeded". A little bit of googling revealed, that this error message was thrown, when the IRC server sends data to the client, and the client doesn't load it fast enough.
Using the incredibly long chat log history of our moderators we were able to track down these errors to their first occurences to October 2019 when we roughly breached 1900 users online at once. Since then FAF has been growing (up to 2200 users online at peak) and so grew the problem as well.
So there is a coincidence between the SendQ exceeded error and the amount of relogin/logouts. Now we tried to connect the dots:
So the idea was: If we could solve the IRC error, we might solve / reduce the users CPU load. A potential solution was to increase the buffer size before the server sends the error message.
The situation was perfect: 2000 users online. A configuration change. All users would automatically reconnect. We could live-test if the change would make the server go away or persist.
So we flipped the switch and restarted the IRC server.
We could observe the IRC numbers rising up from 0 to 1400 and then it drastically stopped. 600 users lost. What happened?
We expected the IRC reconnect to be simple and smooth. But when did anything ever went smooth? A look into the server console showed all cores at 100% load, most utilization by the API and our MySQL database. WTF happened?
The FAF client has a few bad habits. One of them is, that it downloads information about all clans of all people online in the chat. This is a simplified approach to reduce some complexity.
Since we only have a few dozen active clans, that would only be a few dozen simple calls which would/should be in cache already. Or so we thought.
As it turns out, the caching didn't work as intended. The idea behind the caching is the following: You want to lookup a clan XYZ. First you look into the cache, and if you don't find it here, you ask our API. Asking the API does take a few dozen milliseconds.
But what happens, if you have 10 players from the same clan? You should look it up in the cache, and if you don't find it there, you would query it once, right? Well that's what we expected. But since the first API lookup wasn't finished when looking up the clan for 2nd to 10th time, each of these lookups would also ask the API.
Now imagine 2000 users querying the clans for every player with a clan tag (~200?). Boom. Our API and our database are not meant to handle such workloads. It didn't crash, but it slowed down. And it took so much CPU while doing this, that there wasn't much CPU left for the other services. It took the server roundabout 20 minutes to serve all requests and go back to normal.
Once we spotted that bug, it was an easy fix. As it turns out we just needed a sync=true to particular cases. Problem solved. Why oh why is this not the default? Well, as it turns out this option was added in a later version of our Spring framework and due to backwards compatibility it will always be an opt-in option. Ouchy.
We saw it again: IRC sucks (well we all knew that before right?). But it seems like the IRC configuration change fixed the SendQ exceeded error. Yay, we actually might reduce the CPU usage for our players.
Also we now know that synchronized caching should be the default, at least in the FAF client
Unfortunately it was revealed again that the FAF api can't handle ultra-high workloads. Unfortunately limiting the amount of available CPUs in Docker does not work in out setup. Further solutions need to be evaluated (e.g. circuit breakers).
So now we know that restarting the IRC server can be a dangerous thing. So would I do it again? Hell yes. I know it sucks for the playerbase if the server is having massive lags due to running into CPU limit. But today we learned so much about what our software architecture is capable and where we have weaknesses.
We can use this to improve it. And we need to. FAF is growing more than I've ever expected. In the last 12 months our average-online-playerbase grew by ~300 users with the latest peak being 2200 users. When we moved to the new server beginning of the year I thought that our new server can handle around 3000 users, but as it turns out there is more to do.
You want to contribute to FAF but you're unsure? You think you lack the skills? Well then read on, todays blog is for you! Even though it's a story about developers, it applies to all contributors! (Also watch my YouTube tutorial on how to start contributing).
I want to tell you 2 independent stories about 2 very different developers:
I once met a Java developer at a party in his late twenties who was working in a software consultancy for a few years. From a trustworthy source I knew he was quite talented and he was very unhappy about his salary. Back then my company was looking for developers, so I approached him and asked him about his skills. So he was doing mostly Java frontend (ugh, not so modern) and regular backend stuff. I told him that my company was hiring. Of course he was interested, so I handed in his application.
I met him again before the interview and talked to him. I said "Dude, I saw in your application that you have no experience in software development in the cloud with Kubernetes and Docker and all the fancy shit. I have this open source project called FAForever. I could show it to you and explain you a lot of stuff before the interview." He declined and I asked why. "Well, I spend so much time at work, I don't want to drag it into my own personal life, but of course I will say I'm eager to learn it on the job." He was not hired.
Why am I sharing this story? First of all, I'm not criticizing here. It's an absolute fair point to strictly separate your personal life from your work life. I'm sharing this story because it shows, that not every developer has the same mindset about learning in his free time. It's also obvious that it's not the skillset alone that makes a developer suitable for FAF.
Let's take another example. A few years ago there was a dude lurking in the FAF Slack offering help. He knew FAF for quite some time as a player, but recently stumbled over the open source project. He had a vision of bringing back Galactic War, but he had only basic programming experience. But was experienced in database queries as he was working as a consultant for ERP systems. He offered some help on administrative tasks on the server, e.g. running manualy queries (a lot of work was done directly in the database back then), helping on patch days or keeping an eye on the less-important software like the website.
Over time he gained more and more responsibilities and even started more and more on development. But man, did he fuck up things. One day he wanted to backup the website before an upgrade and gzipped all files, not knowing that the gzip command deletes the files afterwards, which resulted in afew hours downtime of the website. Or when he tried to fix some server permissions and locked out the MySQL database from its own files. But he learned. And read docs. And improved stuff. Step by step. He didn't do it all on his own of course. He was mentored by one of the reigning DevOps councilors back then.
After hundreds of hours spent on writing code and scripts, analyzing bugs, dockerizing aps, migrating servers, building/replacing/configuring software he became well sufficient with the way of software development, that he applied for a software developer job. Even though he had no commercial track record, his knowledge on software development and the cloud related technologies (e.g. Docker) was good enough, that he got the job anyway. In the same year, the dude became DevOps councilor. Some of you might have figured it out already: That guy is me. Hello!
So why am I sharing "my FAF life story" with you? Because I wanna show off? Well... maybe a little bit But moooore important in my opinion there are some key learnings that I want to share with you:
I hope you didn't get the impression that you need to spend hundreds of hours to get into FAF. You don't need to, if you want to contribute in one area. And nobody expects you to.
The most important thing is, that you have fun. I have seen many people leave because they lost fun by overacting. For me it was worth it and it was fun to spend the amount of time. Since I have a kid I have to make more and more cuts in my FAF time. That's also one of the reasons why I am trying to motivate you to become the next generation of FAF developers. The one day will come when I want to pass the baton.
Some people claim that FAF is just a playground for developers, so that they have some cool technology to play with for their resume, while breaking important features for the playerbase. While I can't deny that we use our result for resumes (well I wouldn't have gotten job without it), I can assure you, that we do no play around in terms of "resume driven development".
Just for your post I consider bringing back the feature to downvote posts.
ICE is not a failure. If you had a good connection before, than you weren't part of the 1/3 of the FAF community that didn't.
And reconnecting to games after internet loss is a savior.
This forum is the replacement for the old phpBB one. You probably have a lot of question, so we try to answer some of them already:
We now feature a single sign login bound to your FAF account. You do not need to create a forum account, you just login with your FAF username and password. If you change your FAF account name it will also change in this forum.
This allows to always be up to date with playernames. But even more important it is the ultimate spam prevention: a spam bot can't register to the forum without registering FAF first. This prevents all automated spam attack against the underlying software nodebb.
It has a drawback though: If you can't create or access your FAF account you can't ask for help in the forum. We kindly advise everybody to move such request to the #technical-help channel on Discord.
We dropped a lot of sub-forums that weren't really used before. We might reconsider changing more stuff based on your feedback.
We are now running on a software called NodeBB. Nodebb aims to also look nice on mobile devices. Our NodeBB uses MongoDB under the hood which is a much more suitable datastore for a forum (=> better performance)
phpBB was not fine! Let me tell you why:
This is a very good question. We will not migrate it, since this would not work out with the new forum structure and the new account linking (in the old forum we have no verified connection between a forum account and a faf account).
We want to make the forum read-only for now and maybe somebody in this community (wink, wink!) wants to write a script to turn the old forum into a static page archive.
... is debatable for sure. It took me 1 1/2 year from the first evaluation to the working solution as of now. Design was no priority topic at this point. But NodeBB offers a lot. If you are interested in helping out, please contact me.
The obvious choice would have been migrating to a newer version of phpBB. But as I mentioned before there are some issues that would have made this transition a walk on a tight rope.
Our phpBB installation is 9 years old, the underlying phpBB 3.0 release even older (according to Wikipedia work on phpBB 3.0.x began in late 2002, first release candidate was in May 2007). The web software development world was a completely different one back then as it is today:
phpBB 3 had no concept of installable mods or plugins. If you wanted to extend your installation you would follow the extension developers guide and manually edit php code files and manually run SQL statements against the database, thus adding fields, tables and whatever the developers came up with. Also version control on the running server was not a thing back then.
Unfortunately our installation made use of this. For example the whole like-system is deeply intrusive into every part of the application. The "solved" feature in the help forum is another example. Eventually no active server admin knows what has been changed to the system and there is no way to look it up.
All of these additional features would instantly break on a new version, as there is no corresponding code handling this in a newer version. Furthermore every change to the database could make a potential upgrade fail. And there is nobody out there to help you upgrade a manually fucked up 3.0 installation throughout version 3.1 and then 3.2. And if you then run into some weird unexplainable issues due to some weird migration leftovers nobody will be able to help you.
In addition to that past experience shows that former administrators seems to have manually fiddled with the database as some features in the database are broken or don't work as intended.
An upgrade would have come with high risk, unknown outcome and lots of efforts. This is the reason why we decided to look for a solution with a clean installation. Furthermore phpBB 3.2 didn't even match our criteria.
Before selecting any kind of software (or product) you should know what you want. Knowing the issues of the old forums and from other issues we had in FAF over the last 5 years, I made a list of requirements before I went to search for the market.
Eventually NodeBB was the only product I found matching all criteria, so I'll leave out the comparison with other software out there.
Eventually the points 6 and 7 were the dealbreakers on most forum alternatives out there.
Guys, stay cool.
ThomasHiatt brings some valid points. To dismiss them with the argument "all work is voluntary" does not do justice to the matter.
Nevertheless, I would like to start by emphasizing that we have achieved positive results in recent years. The ICE adapter works and has, in my opinion, brought about considerable improvements. The new replay server no longer crashes once a week and generates thousands of broken replays. We have moved the server several times when the old one crashed under load almost every day. All these changes did not go smoothly. Nevertheless, they have improved FAF for everyone, including ThomasHiatt.
None of these improvements would have been possible without the tireless efforts of our volunteers. And none of these projects would ever have gone live if mistakes were not tolerated.
We all try to meet the wishes of the community. These can be bugfixes, small feature requests, but also big tickets like Team Matchmaker or Galactic War.
The basic idea we follow is that when we touch code for a bug or feature, the new code should align with our big tickets. The biggest change at the moment is the Team Matchmaker and it has a big impact on our rating handling.
Now, unfortunately, FAF has another complexity in the interaction of Server<->Client<->Game. People working on the lobby server usually have little knowledge of the game code and vice versa. Unfortunately there are often misunderstandings in communication or bugs in the interaction, especially if server and game are subject to different release cycles. Another problem is the complexity of testing such changes. It cannot be automated and even manual testing is only possible to a very limited extent.
I hope I could make it a bit clear why bugs happen and will continue to happen. The alternative is to stop making any changes until the software slowly dies (yes, software ages, I plan to write a blog article on this topic to explain it in more detail).
So the crucial question is not how to prevent errors, but how best to handle them. And here in this specific case we as developers and especially I as the responsible councilor have failed on several levels in my opinion:
From my point of view it's okay to make room in the forum or elsewhere to escalate the issue, as long as you do so in a respectful and non-personal way. Alternatively, a clarifying conversation with me would have been more effective. That's what I am here for... well, also... somehow.
To cut a long story short: We have understood your distress and a solution proposal is now available that still needs to be evaluated (testing is difficult again, so it will probably be a shot in the dark...). Hopefully it can be rolled out with the planned server update in 3 days.
After a talk with Terrorblade we decided to do a pre-requisite check as some potential users a very disappointed when they learn after registration that they actually can't play on FAF.
This is the maximum set of messages you can receive:
The registration was also cleaned up. Terms of service and data privacy are now hidden behind a button and will show as a scrollable in-page popup.
Now you get a nice popup confirmation after registration with the ability to resend the email.
Last but not least the API error messages are finally translated into proper messages (as you can see in the background)
During testing I also noticed that the "username reservation" logic is broken for years in the API (= if you rename, your previous name is reserved up to 6 month for you exclusively unless you rename again). That was fixed right away.
The client setup section already looksup the latest release on Github:
The other content is still mostly missing.
The most crucial part to be done now is the Link to Steam. the user needs to login first and then be redirected to Steam. Also the explanation how to configure your profile to public plus the error handling if it fails is the most crucial part as this led to many many support requests in the past.
So stay tuned for the next update.
Excuse me? What the fuck am I reading here?
Yes, the vault (map & mod) could need some love, but there are quite a few reasons why this is diffcult. Nevertheless, your attitude is astonishing:
There is a matching quote that comes to my mind:
And so, my fellow Commanders: ask not what your community can do for you — ask what you can do for your community.
-- Commander John F. Kennedy
And so, my fellow Commanders: ask not what your community can do for you — ask what you can do for your community.
-- Commander John F. Kennedy
Posts like yours make me question my commitment to FAF.
The fix was deployed today. Should be back up
Small recap of todays update:
Due to the ZFS reconfiguration we were now able to setup auto-snapshots of the whole system. So there are now dedicated internal backups not just of the db but of everything. We still need to improve external backups (right now only the MySQL database is subject to backup)
Shame on me, just fixed the date. It will be on Friday.
The new pool didn't go active. Looks like there was some miscommunication with Archsimkat on how to maintain the google doc for the pool list.
Should be fixed now.
The next server update has been scheduled for the 13.02.2021 betwen 8.30 - 10.30 GMT (=UTC).
Single player or multi player? How do you launch the game?
Sounds like you start the game outside of the FAF client and without the patches.
Deinstall client. Manually delete all remains of the folder (usually C:\Program Files\Downlords Faf Client), reinstall
Sounds cool, give it a try
Not possible from the replay stream.
The replay stream only contains the game orders. All of the actual gamestats are only available while the game runs through the simulation.
From my perspective tutorials are technically almost the same as coop mission, but they are restricted to 1 player.
The current coop missions are here: https://github.com/FAForever/faf-coop-maps
If somebody is willing to make some cool tutorial scenarios we'll find a way to integrate them into FAF.