Today two thousand players encountered once again how fragile our whole server setup sometimes is and how an unlikely chain of events can cause major issues. In this talk I want to shed some light on how the chain started moving.
Start of the journey: The chat user list
We are currently investigating CPU load issues in the client which we could track down to updates of the chat user list in #aeolus. If you open the chat tab you see a partial list of users, as not the whole list fits to the screen.
We are using a
ListView component from our UI framework for this. Each row or
ListItem (a user or a group name) is one element in this list view. However, it would be very memory hungry, if the application would render each chat user's list item (~2000 on peak times) if you can only see a few dozen of them.
Therefore the list view component only holds a pool of list items that you can actually see and re-uses / redraws / repopulates them as the list changes.
When profiling the application I could observe hundreds of thousands of updates to the list over a period of 10-15 minutes with about 2.000 users logged in to the chat. Something was causing massive changes to the user list and we couldn't find a reason.
Mass changes out of nowhere
On a different topic we noticed an increasing number of errors when trying to login to the IRC on peak times for no obvious reason as well. Sometimes it took multiple minutes to login.
But then somebody using a regular IRC client looked into the logs and found an interesting thing: a lot of people were logging in and out in a very short period with the quit message "Max SendQ exceeded". A little bit of googling revealed, that this error message was thrown, when the IRC server sends data to the client, and the client doesn't load it fast enough.
Using the incredibly long chat log history of our moderators we were able to track down these errors to their first occurences to October 2019 when we roughly breached 1900 users online at once. Since then FAF has been growing (up to 2200 users online at peak) and so grew the problem as well.
So there is a coincidence between the SendQ exceeded error and the amount of relogin/logouts. Now we tried to connect the dots:
- Whenever you join a channel you receive the list of users in this channel.
- If this list goes beyond a certain size the error pops up.
- The error caused lots of logins/logouts.
- Each of these events caused the chat user list to update multiple times.
So the idea was: If we could solve the IRC error, we might solve / reduce the users CPU load. A potential solution was to increase the buffer size before the server sends the error message.
The IRC restart of Doom
The situation was perfect: 2000 users online. A configuration change. All users would automatically reconnect. We could live-test if the change would make the server go away or persist.
So we flipped the switch and restarted the IRC server.
We could observe the IRC numbers rising up from 0 to 1400 and then it drastically stopped. 600 users lost. What happened?
We expected the IRC reconnect to be simple and smooth. But when did anything ever went smooth? A look into the server console showed all cores at 100% load, most utilization by the API and our MySQL database. WTF happened?
The cache that cached too late
The FAF client has a few bad habits. One of them is, that it downloads information about all clans of all people online in the chat. This is a simplified approach to reduce some complexity.
Since we only have a few dozen active clans, that would only be a few dozen simple calls which would/should be in cache already. Or so we thought.
As it turns out, the caching didn't work as intended. The idea behind the caching is the following: You want to lookup a clan XYZ. First you look into the cache, and if you don't find it here, you ask our API. Asking the API does take a few dozen milliseconds.
But what happens, if you have 10 players from the same clan? You should look it up in the cache, and if you don't find it there, you would query it once, right? Well that's what we expected. But since the first API lookup wasn't finished when looking up the clan for 2nd to 10th time, each of these lookups would also ask the API.
Now imagine 2000 users querying the clans for every player with a clan tag (~200?). Boom. Our API and our database are not meant to handle such workloads. It didn't crash, but it slowed down. And it took so much CPU while doing this, that there wasn't much CPU left for the other services. It took the server roundabout 20 minutes to serve all requests and go back to normal.
Once we spotted that bug, it was an easy fix. As it turns out we just needed a
sync=true to particular cases. Problem solved. Why oh why is this not the default? Well, as it turns out this option was added in a later version of our Spring framework and due to backwards compatibility it will always be an opt-in option. Ouchy.
We saw it again: IRC sucks (well we all knew that before right?). But it seems like the IRC configuration change fixed the SendQ exceeded error. Yay, we actually might reduce the CPU usage for our players.
Also we now know that synchronized caching should be the default, at least in the FAF client
Unfortunately it was revealed again that the FAF api can't handle ultra-high workloads. Unfortunately limiting the amount of available CPUs in Docker does not work in out setup. Further solutions need to be evaluated (e.g. circuit breakers).
Recap and next steps
So now we know that restarting the IRC server can be a dangerous thing. So would I do it again? Hell yes. I know it sucks for the playerbase if the server is having massive lags due to running into CPU limit. But today we learned so much about what our software architecture is capable and where we have weaknesses.
We can use this to improve it. And we need to. FAF is growing more than I've ever expected. In the last 12 months our average-online-playerbase grew by ~300 users with the latest peak being 2200 users. When we moved to the new server beginning of the year I thought that our new server can handle around 3000 users, but as it turns out there is more to do.