Unexpected chain of events: How a users cpu load issue paralysed the server

15

Today two thousand players encountered once again how fragile our whole server setup sometimes is and how an unlikely chain of events can cause major issues. In this talk I want to shed some light on how the chain started moving.

Start of the journey: The chat user list

We are currently investigating CPU load issues in the client which we could track down to updates of the chat user list in #aeolus. If you open the chat tab you see a partial list of users, as not the whole list fits to the screen.
We are using a ListView component from our UI framework for this. Each row or ListItem (a user or a group name) is one element in this list view. However, it would be very memory hungry, if the application would render each chat user's list item (~2000 on peak times) if you can only see a few dozen of them.
Therefore the list view component only holds a pool of list items that you can actually see and re-uses / redraws / repopulates them as the list changes.
When profiling the application I could observe hundreds of thousands of updates to the list over a period of 10-15 minutes with about 2.000 users logged in to the chat. Something was causing massive changes to the user list and we couldn't find a reason.

Mass changes out of nowhere

On a different topic we noticed an increasing number of errors when trying to login to the IRC on peak times for no obvious reason as well. Sometimes it took multiple minutes to login.

But then somebody using a regular IRC client looked into the logs and found an interesting thing: a lot of people were logging in and out in a very short period with the quit message "Max SendQ exceeded". A little bit of googling revealed, that this error message was thrown, when the IRC server sends data to the client, and the client doesn't load it fast enough.

Using the incredibly long chat log history of our moderators we were able to track down these errors to their first occurences to October 2019 when we roughly breached 1900 users online at once. Since then FAF has been growing (up to 2200 users online at peak) and so grew the problem as well.

So there is a coincidence between the SendQ exceeded error and the amount of relogin/logouts. Now we tried to connect the dots:

  • Whenever you join a channel you receive the list of users in this channel.
  • If this list goes beyond a certain size the error pops up.
  • The error caused lots of logins/logouts.
  • Each of these events caused the chat user list to update multiple times.

So the idea was: If we could solve the IRC error, we might solve / reduce the users CPU load. A potential solution was to increase the buffer size before the server sends the error message.

The IRC restart of Doom

The situation was perfect: 2000 users online. A configuration change. All users would automatically reconnect. We could live-test if the change would make the server go away or persist.

So we flipped the switch and restarted the IRC server.

We could observe the IRC numbers rising up from 0 to 1400 and then it drastically stopped. 600 users lost. What happened?

We expected the IRC reconnect to be simple and smooth. But when did anything ever went smooth? A look into the server console showed all cores at 100% load, most utilization by the API and our MySQL database. WTF happened?

The cache that cached too late

The FAF client has a few bad habits. One of them is, that it downloads information about all clans of all people online in the chat. This is a simplified approach to reduce some complexity.

Since we only have a few dozen active clans, that would only be a few dozen simple calls which would/should be in cache already. Or so we thought.

As it turns out, the caching didn't work as intended. The idea behind the caching is the following: You want to lookup a clan XYZ. First you look into the cache, and if you don't find it here, you ask our API. Asking the API does take a few dozen milliseconds.
But what happens, if you have 10 players from the same clan? You should look it up in the cache, and if you don't find it there, you would query it once, right? Well that's what we expected. But since the first API lookup wasn't finished when looking up the clan for 2nd to 10th time, each of these lookups would also ask the API.

Now imagine 2000 users querying the clans for every player with a clan tag (~200?). Boom. Our API and our database are not meant to handle such workloads. It didn't crash, but it slowed down. And it took so much CPU while doing this, that there wasn't much CPU left for the other services. It took the server roundabout 20 minutes to serve all requests and go back to normal.

Once we spotted that bug, it was an easy fix. As it turns out we just needed a sync=true to particular cases. Problem solved. Why oh why is this not the default? Well, as it turns out this option was added in a later version of our Spring framework and due to backwards compatibility it will always be an opt-in option. Ouchy.

Lessons learned

We saw it again: IRC sucks (well we all knew that before right?). But it seems like the IRC configuration change fixed the SendQ exceeded error. Yay, we actually might reduce the CPU usage for our players.
Also we now know that synchronized caching should be the default, at least in the FAF client
Unfortunately it was revealed again that the FAF api can't handle ultra-high workloads. Unfortunately limiting the amount of available CPUs in Docker does not work in out setup. Further solutions need to be evaluated (e.g. circuit breakers).

Recap and next steps

So now we know that restarting the IRC server can be a dangerous thing. So would I do it again? Hell yes. I know it sucks for the playerbase if the server is having massive lags due to running into CPU limit. But today we learned so much about what our software architecture is capable and where we have weaknesses.

We can use this to improve it. And we need to. FAF is growing more than I've ever expected. In the last 12 months our average-online-playerbase grew by ~300 users with the latest peak being 2200 users. When we moved to the new server beginning of the year I thought that our new server can handle around 3000 users, but as it turns out there is more to do.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

Thanks for being so committed to tracking down the issues.

interesting. nice find!

but it feels like especially from the java client side that the technologies that it is using were not properly selected. it does unnecessary things that the programmers have no control over. neither simple, nor fast. working with pre-made packages glued together.

as for the server side, i was like wtf when i saw that the API can receive database query like parameters which are a big feature and convenient to use, i guess. but do you guys really need this? the most simple solution would be to write a simple database auery in the DATABASE and an API that calls that one query. the further the query is from the database, the more control you lose and you have less chance to operate the database efficiently.
you cannot really prepare for the queries this way for example. there are queries that needs up to date data, there are others that need only a few hours fresh data. you can store the last ones in memory in the database if you know them so you just serve the already queried data.
just ideas, i know it is not a paid project but it is something to learn from...

https://api.faforever.com/data/player?filter=(login=='#name')&include=ladder1v1Rating
@.@

anyways a Firefly quote:
Simon: You had the Alliance on you, criminals and savages… half the people on this ship have been shot or wounded, including yourself, and you’re harboring known fugitives.
Mal: We’re still flying.
Simon: That’s not much.
Mal: It’s enough.

: D

nice ghetto streams: https://www.twitch.tv/foodlfg

@foodlfg said in Unexpected chain of events: How a users cpu load issue paralysed the server:

interesting. nice find!

but it feels like especially from the java client side that the technologies that it is using were not properly selected. it does unnecessary things that the programmers have no control over. neither simple, nor fast. working with pre-made packages glued together.

I can't defend each and every choice made in the Java client, but I can tell you if we had a choice, we use battle-proven standard libraries used by thousands of developers. Your statement goes more into a general critic against Spring framework with all its "magic" involved. It can be quite difficult, but that doesn't mean you don't have the control over it. The general problem is complexity, as Spring components try to cover everybodys use case and can be configured in dozens of ways.

However, some libraries turned out to be a real stick in the mud. But you can't always know that beforehand. JavaFX e.g. is a pretty solid library in general. But using the JFoenix library for "nicer" ui controls was a bad choice as the developers stopped updating it to newer versions. Or the current IRC library used served us well for 5 years, but seems to make problems (not sure yet) if we hit 2k+ users in a single channel (who would have known FAF would ever reach that numbers?).

If you would write a new client from scratch in whatever languages you would have the same problems. Again. Which libraries are stable and well maintained for the next 5 years? Should I use the trendy libraries or the stable one? What if the trendy one supersedes the stable one?

as for the server side, i was like wtf when i saw that the API can receive database query like parameters which are a big feature and convenient to use, i guess. but do you guys really need this?

Do we really need this? Yes we do. The library that we use allows us to do basically permissions based proxy access to the FAF db. This includes:

  • We can expose everything that we want (not all tables of the FAF db are public but all that are used).
  • We can put security filters on what we need to protect. (e.g. some fields or some tables are only visible to privileged people)
  • We can also create and update stuff (also permission based)
  • And some things are just free: sorting, filtering, including other files

We had a different API before where we wrote individual queries for each use case and it was a huge amount of work even for the starting use cases back then. Do you want to rewrite the sorting and pagination rules each time on your own? Also the include concept would be lost then, as it is very generic and you would need to make dozens of api calls instead.
By now we have the whole moderation client built around this API. The java client uses it extensively. The website uses it (for leaderboards and the league things) and even 3rd party tools make use of it. So I think the whole API concept is a huge success.

the most simple solution would be to write a simple database auery in the DATABASE and an API that calls that one query. the further the query is from the database, the more control you lose and you have less chance to operate the database efficiently.

But it's not one query. It's hundreds of queries. I don't want to write and test these manually if there is a library that can build almost optimal SQL queries if you don't misuse it.

you cannot really prepare for the queries this way for example. there are queries that needs up to date data, there are others that need only a few hours fresh data. you can store the last ones in memory in the database if you know them so you just serve the already queried data.
just ideas, i know it is not a paid project but it is something to learn from...

That is the downside of it yes, indeed. Flexibility comes at a price. We need to use it with care and we need to think about cancelling too expensive queries somehow in the long run

Eventually it always comes down to this: We are a very small core team and there is rarely more than 1 core developer per service. Thus we need to make the right abstractions to keep the code size small enough to handle. If I write hundreds of queries manually to squeeze out maximum performance, I can no longer make changes to the database without fixing dozens of them (apart from the fact that I couldn't do anything else for FAF as I'm busy just with that). We made a trade-off here and choose flexibility over performance.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice