Greetings fellow members of the FAF community,
as you are probably quite aware of, many users experienced several connectivity problems over the past couple of weeks. This was and still is due to only recently diagnosed issues, which made connections unstable for many FAF users. We will go over the event timeline and outline what would be required to get things back to normal.
First and foremost, we apologize for the lack of communication concerning these issues - the cause of problems was unknown initially, and we didn't want to share information we weren't sure was correct. We strongly hope that this rundown will give all users a good overview of what was and still is causing them.
The Disk Issues
The first problem detected and tackled by the DevOps team was the failure of a physical disk on the server that FAF services were hosted on, which occurred on 9th of April. This was a serious issue, so the members with the know-how took immediate action, and the servers were returned to their regular working state in quick order. The services returned to their normal functionality within an hour of the problem being detected.
A month later, on 6th of May, another disk had failed. This time, the disk that failed was the main disk that would allow the restart of services. Due to the type of failure, remote access was unable at the time members were on hand. The schedule of the DevOps team as well as the service provider made it hard to estimate how long it would take for the servers to get back into working order. The service was restored on 7th of May, less than 8 hours after the failure was detected.
In light of the outages, the DevOps team decided to use this chance to make a scheduled migration of service hosting on to a newer, substantially better server. To be exact, the new main server is a machine with doubled performances compared to the old one, which will make any non-game services work that much more smoothly. This was completed in short order on 28th of May with some residual problems that were resolved quickly after the migration.
Upgrading the main server actually didn't help with connection instability experienced by many. Some users were concerned that the issues were either caused by the game updated that happened on the 20th of May, or the most recent client update. We assure you this is not the case, and we hope the next game or client update, or balance patch will be awaited with anticipation, rather than dread. So what is the cause? Let us explain that, now that we actually know with certainty.
In-game connection problems
To understand why connection issues appeared, we need a short crash-course on how FAF servers work. First, there is the "main" server, the one we mentioned was migrated to a new machine, and is currently in pristine state. This server allows for players to connect to all non-game-related services - the forums, FAF website and the client. However, once a game is started, this main server no longer has anything to do with the game. For example, if you already were in the game, and the main server crashed, the game would still play normally.
When the game is started, the last thing the main server does is it establishes connections between players. First, it tries to make peer to peer connections - these are direct connections between players, that allow the most straightforward exchange of data. Sometimes, however, this type of connection can not be established for whatever reason, and that is where another type of server jumps in to help - the coturn servers. These servers effectively work like a middle-man, getting data from one player and sending it to another. Most issues players recently experienced concerning unstable connections had to do with problems on these servers.
Resolving the issues
Originally, there was only 1 cotourn server, but recently 4 more were added. DevOps team hoped that adding additional coturn servers would fix connection instability. While the good news is that additional coturn servers might improve connection health in general for the future games, this in itself did not actually serve to fix the problems people were having. After some investigation, the DevOps team concluded that the issues with both the old and the new coturn servers were caused by unwanted elevated traffic on these servers. In terms some people might be familiar with, the coturn servers are being DOS-ed - not an unheard of occurence in the business of server hosting, but one that FAF managed to avoid until recently.
These kinds of problems have established methods of fixing and prevention. The current issue the DevOps team is facing with implementing these is the amount of work that is required - they are unable to start work until their personal schedules allow for that. This means that, unfortunately, we are currently unable to give an estimate on when these issues will be resolved. We would like to use this opportunity to ask our fellow members of FAF for patience, and for help - if you have experience with Java/Kotlin and server maintenance, please contact us, either via the forums or the discord server.