If you've played a game over the past 2 years, you know it. That little notification that pops up when you join the game telling you that some thing called the “FAF Ice Adapter” has been started (which is going to vanish in the next release). But what the heck even is that and how does it help my game run without disconnects?
Warning: This post is quite long and might go into some details on how the internet works as well as the inner workings of FA's simulation and networking. It is also designed as an in-detail explanation and reference I was suggested to write after having to explain ICE to quite a lot of people including developers, so I won't skip the ICE related details.
I tried to explain them as well as possible, but feel free to skip especially the first two sections if you wanna get to the point quicker. I provide a TL;DR whenever possible.
FA's simulation model
First of all, let us dive a bit into the way the game operates.
When you launch a game of FA, you do not connect to a server. There is no server. Back when the game was built, internet connections weren't capable of handling thousands of units per player, so the game had to be developed in a different way. Instead of using a central server, maintaining and managing the entire game state, every game client knows about the complete state of the game. This avoids sending information about what's going on in the game and synchronizing the positions and states of all units and projectiles over the network. Instead, the only information clients need to exchange are the player inputs. The entire simulation is then built deterministically, so that given the state of the game, as well as all player inputs, every player's computer arrives at the some result after calculating one tick in the game.
This removes the need for a (costly) server but also comes with some drawbacks, mainly off-loading CPU load to the clients, therefore slowing the game down to the simulation speed of the slowest client, as well as having to ensure all player inputs are present when the next tick is to be processed. If a player isn't finished calculating the previous tick/step or their inputs where dropped somewhere, the entire game will lock up and you'll get either stuttering or even a connection issue / “player quiet since” dialog. (a so called “lockstep model”)
TL;DR: FA has no server, each player runs the entire game equally
FA's network model
As there is no server, players somehow need to exchange messages for coordinating the game and sending their input to each other player each tick. Therefore the game uses a peer-to-peer architecture. This means, that each client in game is connected to each other client directly.
To allow for network congestion and high distances, there needs to be a gap between the point in time when the command from one player is sent and the point it is processed at. When you issue a movement order, it is in fact only executed 500 ms later (could also be 250, not completely sure), even though the UI layer of the game conveniently hides that from you by starting to rotate the unit immediately. This is in fact only visual and not present in the actual game simulation.
TL;DR: The take away here is that each command is delayed by 500 ms no matter what, so as long as your round-trip latency (RTT) to each other peer stays below that threshold, the game will run just fine. If it rises above 500, even just 550 for one other player, the entire game will start stuttering. This needs to be the case for each connection between all players. If one connection breaks, the entire game will lock up.
Running multiplayer games
A 2v2 only contains twice as many players compared to a 1v1. A 4v4 doesn't sound to bad, given that it's only 8 players. But when you look at the connections that have to be maintained in the background, the story is quite different. As there has to be a connection between each pair of players, the number of connections actually scales quadratically with the number of players:
Players | Connections |
2 | 1 |
4 | 6 |
8 | 28 |
12 | 66 |
16 | 120 |
Remember, even one of those connections breaking means the entire game locks up.
The ghosts of older times - IPv4 and NAT
Now let's get into the actual networking. The Internet works using addresses. Each device has an address, an Internet Protocol (IP) address. If you know the address, you can talk to it.
When packet switching was first designed, it wasn't conceivable that one day there would be more IP capable devices than humans. 32 bits (0s and 1s) per address seemed enough, you know them as the 0.0.0.0
- 255.255.255.255
notation. That's 4,294,967,296 addresses, of which 86 % are available for public usage.
Right now, each US household owns on average of 10 internet connected devices. There are more end user devices connected to the Internet than humans on this spaceship, not even taking into account business devices, public infrastructure, the servers of service providers on the internet and the infrastructure of the internet itself. All of which, need IP addresses.
Put simply, we're completely out of IPv4 addresses. This problem was discovered early, so a replacement was designed called IPv6. (The protocol identifier for IPv4 was 4, 5 was already in use for ST, so there never was an IPv5). IPv6 uses 128 bit for addresses. That's quite some more addresses. 340,282,366,920,938,463,463,374,607,431,768,211,456 to be exact. We could give each grain of sand on this planet an IPv6 address and still be left with 99.9999 % of addresses.
The big issue: all the existing infrastructure uses IPv4, the transition to IPv6 will take some time. (here's a map)
TL;DR: We don't have enough IPv4 address for everybody.
NAT
important for ICE
In the mean time, we'll have to get tricky to save IP addresses. The major building block in doing that is Network Address Translation (NAT). Instead of giving every device an address, we separate the network into lots of local sub networks and give each of them one address. So your router gets assigned a public IPv4 address and all devices in your household use the same address to communicate. Each device get's a private address that is only valid within your local home network (192.168.X.X
, 10.X.X.X
, …).
Each time you want to open a website, your device sends a request to your local gateway (router) which translates your local network address and replaces it with its global address. The only issue with this: when it receives an answer from the website addressed to your public address, it needs to figure out which local device to send that packet to. It does so by assigning you a random port X on sending the packet using the public address and remembering that. When the packet comes back on port X, it knows exactly which private/local address to translate it to.
from: https://commons.wikimedia.org/wiki/File:NAT_Concept-en.svg
Sidenote: This prevents you from hosting a webserver, as no one can reach your device without you connecting to them first. You need to tell your router what to do with the packets it receives, which is exactly what port forwarding does. ("If you get a package on your public address addressed to port 80, that should go to PC-X in the local network")
NAT Traversal
This is fine, if your users “behind NAT” only access remote services, but do not have incoming connections. But if you dream of establishing a connection between two people behind a NAT, which is what a peer to peer application like FAF does (66 times for a 4v4), you simply cannot. The routers on both ends just won't know what to do with the packets they get.
To get around this, NAT traversal methods have been developed. One of them is hole punching. Hole punching works by both clients just sending each other packets on a fixed port, therefore convincing their NAT they had an outgoing request, telling it what to do with incoming packets. It requires a third party server, that is publicly reachable telling both clients to start transmitting to each other.
All of that is assuming though that your router actually has a public IPv4 address. Oftentimes this just isn't the case though, e.g. in Dual Stack Lite. There, you actually only have an IPv6 connection and IPv4 traffic get's tunneled via that. Your ISP (internet service provider) will then often use something called a Carrier-Grade NAT (CGNAT) / symmetric NAT. There, multiple customers of the same ISP get put behind a NAT with one or multiple! public IP addresses.
So your requests might use different public IP addresses shared with hundreds of other customers. Hole punching is impossible in that circumstance.
TL;DR: NAT traversal and hole punching can get you through NAT, but only in some situations.
FA Networking
To recap: we need to establish a connection between all players, all of which are located behind a NAT and cannot be reached under a public IP address. If a connection between any two players breaks, we cannot run the game.
The game by itself already does some kind of NAT traversal, it can already use STUN. It's not very reliable, most of the time only works for simple host to host connections and rolls a dice on being able to hole punch.
This meant that most players had to resort to port forwarding, telling their router by modifying a setting which incoming packets to send to their machine. (which is impossible with CGNAT)
Legacy FAF Networking
To mitigate this, FAF implemented a proxy server that allowed players to establish a connection even if they were unable to get a direct connection. On client startup, the server would test if hole punching succeeds, and if not establish a proxy connection for the game to use.
This way, players with a forwarded port or hole punching succeeding got a direct connection, anybody else was still able to connect to other players using the proxy. Somewhat, sometimes, if they got lucky…
A tale of REFAFing
New players might no longer even know this word. The REFAF was an essential part of FAF culture for many years. It was used to tell someone trying to connect to a game that they should rejoin. But not just the game, restart the entire FAF client.
It was pretty common for players not to get a connection to others in lobby, almost all games involved certain players not being able to connect to each other in lobby and having to refaf multiple times, before the game had even started.
If you managed to start the game, in a substantial amount of matches, at least two players would then loose connection during the game (locking the entire game), still able to chat with everyone else, trying to figure which one of the two players is going to leave so the game may continue. (an especially bad debate, if the players happened to be on opposing teams)
Side note: But what the heck does REFAFing actually do? The connectivity type was defined by the test run on client launch. Subsequently, all connections in all games and to all players would use the same connection type. If the server could reach you directly, but you didn't manage to get a direct connection to another player, you were out of luck as you weren't using the proxy. REFAFing reruns the connectivity test, hoping for a different result. (there's likely way more issues at play here, but that's one of the main reasons)
So, after “fixing connections” turned out as the top contender in a survey about what FAF users wanted, it was decided that ICE, being a new, standardized protocol to establish connections was the future of FAF connectivity. And so began the 2.5 year development cycle that successively yielded 5 different, all-most nearly completed, ice adapters.
The ICE Adapter
The FAF ice adapter aims at managing all connections for the game. It's a program running on your local system started/stopped by the FAF client that offers an interface to the game. The game is then told by the FAF server/client/adapter to connect to other players, supplying it a wrong IP address. If you join a FAF game nowadays, your game will think all of your peers it's connected to are located on your local machine. It sends all its traffic to the ice adapter, which then figures out how to forward it to the other players' ice adapters, which then forward it to their games.
The ICE adapter uses the Interactive Connectivity Establishment (ICE, standardized as per RFC5254) to, well, interactively establish connections.
How does ICE compare to the old solution
ICE actually doesn't change that much compared to the old solution. It still uses direct connections via hole punching (or forwarded port, but DON'T do that) and it still uses a relay in case the direct connection doesn't work. The difference is, that ICE is an industry standard that streamlines everything and is way, way more intelligent.
- ICE reruns the connectivity establishment interactively for each peer, you can use different methods (local, direct, relay) for different players
- ICE monitors the connection and reconnects via relay when the direct connection breaks
- ICE monitors the connection and survives a change in IP address, if your power goes out, connecting your laptop to your phone's hotspot will reconnect you
- ICE supports local connections, if you're in the same network as some other player, you'll get a LAN connection (before you had to set and forward different ports for each player and then got a connection via the internet)
- The ICE adapter can pick different relays dependent on your location
- relaying is done using the TURN protocol, which is also standardized and probably more reliable
DO NOT PORT FORWARD WITH ICE
You do no longer need to forward any ports for ICE and won't improve anything by doing so.
TL;DR: DO NOT PORT FORWARD FOR FAF ANYMORE
How does ICE work:
You've made it this far, awesome! Welcome to the interesting part, how the heck does ICE actually work.
For each player you connect to, the ice adapter will run ICE to establish a connection. The process can be broken down into the following steps:
- Gather candidates
- Send candidates to the other player's adapter via the FAF server
- Receive candidates from the other player's adapter via the FAF server
- Form pairs of candidates (one local candidate, one remote candidate for the other player)
- Try to reach the other player on all pairs (using the local address, sending to the remote candidate)
- Pick best pair to succeeded, nominate, confirm (if failed, disable non relay candidates, go to step 1)
- Start communication
- (monitor connection, echo requests, once per second, restart on connection loss)
The following types of candidates exist (ignore the last one):
- HOST: a candidate obtained from a local interface, this is an IP address someone in your own network can reach you at (for local connections)
- SRFLX: server reflexive - a candidate obtained by asking a STUN server, where do you see this request coming from - usually your “public IP”
- RELAY: a relay candidate, this is basically a TURN server borrowing you its public IP address and port, you can use it by sending data there through a tunnel
- ( PRFLX**:** peer reflexive - a candidate obtained by asking another player, already connected to, where they see the request coming from - allows connection e.g. within autonomous systems, other WANs, without using the internet )
Side note: STUN - session traversal utilities for NAT, TURN - traversal around NAT using relay networks
So in step 1, your adapter wil gather all possible candidates, e.g.
host 192.168.0.10:6120
(your local IPv4)
host [fe80::2d5d:1a01:9e2b:4ac1]:6121
(your local IPv6)
srflx 1.2.3.4:6122
(your public IP)
relay 116.202.155.226:12345
(faforever.com relay)
It will then send those to the other peer and receive their candidate list (analogous, step 2 and 3).
It will then open a ton of sockets and start talking to the other side (4-7). When it reaches someone, it will attempt to communicate, and then establish connection on the preferred pair, an example list of pairs:
host<->host
host<->host
srflx<->host
host<->srlfx
relay<->srlfx
relay<->relay
A relay connection will ALWAYS succeed, therefore in theory the adapter should always be able to connect.
Side note for developers: One adapter is in offerer mode, the other in answerer mode. Game host is always offerer. Offerer sends candidates first, decides on the chosen candidate pair and monitors the connection.
Development
In theory all of this sounds quite simple. I wrote 95% of the current ICE adapter's code within a single day. The issue is with debugging, testing, figuring out what's wrong.
Every machine is different. Every operating system is different. Every network setup is different. Every ISP is different.
There were five attempts at building an ice adapter for FAF. The second adapter was written in nodeJS using WebRTC (by muellni, before I even joined the ICE project). After doing quite some testing, we couldn't figure out why the CPU load was unbearable.
Therefore the next adapter was written using C++ and a native binding of WebRTC. Debugging that took quite some time. That was when I joined the project, building a test client and server that simulated a running FA game for the adapter and took measurements as well as shipped logs to the sever automatically. From that point on we did weekly ice adapter tests with a dozen of volunteers (I want to thank everyone who helped out on that here , if you didn't get the ICE avatar yet, message a moderator) trying to figure out what was wrong while muellni and vk tried to fix the adapter each week.
After multiple months of that (during which I also wrote a 4th adapter as an exercise using the ice4j library), we finally arrived at a point where we had a stable, lightweight, reliable adapter ready for deployment. A deployment date was scheduled (September 2018), we were doing some final tests.
Then the day came when someone told us their game was lagging. After some investigation, we figured out that their upload bandwidth was too low. But it had worked before. As it turned out, the WebRTC ICE implementation was encrypting all of our packets. Some certain huge internet giants (cough) decided that it's a good idea to enforce packet encryption in WebRTC (ice supports multiple encryption types as well as none). In general I agree with that sentiment. For us, it meant 12 bytes of encryption header added to on average 15 byte long game data packets.
That doesn't sound like much. But in practice it meant, everyone able to play 6v6 might now be stuck with 4v4, everyone who was capable of 4v4 would now be having trouble with 2v2.
That was unacceptable. So we searched for a solution, some hacky way to disable encryption. We found none.
During that investigation I had also written a 5th adapter (somehow I accidentally completely lost the code for the 4th one, it was synced to my cloud and committed into git, I still don't get how that happened). So, back to the start. 6 more months of debugging and testing and countless bugs, issues, and (sometimes self-resolving) problems later and we got ready for deployment.
In total, ICE development took 2.5 years, 5 adapters, a test client and server software for simulation, multiple people joining / quitting the project, weekly tests and lots of volunteers.
Side note: I'm not going to go into detail on the python client vs java client issue which was co-dependent with ICE and caused a huge disaster on the deployment weekend.
Where are we today
There's an unresolved issue with the adapter that occurs in <1/100 cases. It is likely due to the OS not being happy with the adapter spawning hundreds of sockets. It happens in native Windows code and I got no clue what's going on, the reason for socket close supplied by Windows is incorrect. If you are experiencing this (lots of errors about sockets in log), try the following in order (see also the Wiki page on troubleshooting network issues): Deactivate virtual network adapters, use a VPN, reinstall Windows, use Linux.
Relay servers
The ice adapter takes into account mutiple relay servers.
Oceanian and especially australian players are often put behind CGNAT by their ISPs, preventing them from establishing a direct connection. Sometimes this can be resolved by calling your ISP and asking them if it's possible to get you off CGNAT. In this case, the relay via coturn (a STUN+TURN server) is needed. FAF's coturn is located in Nuremberg, Germany.
So two neighbors A and B living next door will be relayed via Europe, if you take into account round trip that means: A -> Europe -> B -> Europe -> A
One round trip is 300 ms at best, so two round trips is 600 ms, above FAF's maximum threshold of 500ms. This made quite a lot of players unable to play.
Therefore one community member offered to host another coturn server in ?Sydney? which is taken into account by the ice adapter. This should (and has, when it worked) allowed nearly all oceanian players to player with everyone else around the world completely fine. (as I said above, as long as the latency is below 500ms)
Current issue
There has been an ongoing issue with the oceanian relay over the past months. The FAF server has been under a denial of service attack for quite some time. This forced the server admins to disable ICMP echo on the server. The ICE adapter uses ICMP echo to determine the nearest relay server.
Therefore all users used the european relay on faforever.com, causing troubles as described in the previous section. We have been trying to fix this but couldn't find a permanent solution.
Server restructuring
To improve the infrastructure, the server admins have been moving the coturn server (relay) from faforever.com over to multiple separate machines. This allows enabling ICMP echo again and building a more robust infrastructure. You might experience some issues in game when playing at the moment, although the ice adapter should reconnect quite quickly. Ideally, we're aiming for multiple TURN servers around the world for the best connectivitiy. At the moment there are 2 in Germany, 1 in Australia and there are investigations about adding one for America.
PS: there is no ping in FA, there is no ICMP echo in FA, it's called a latency or round-trip time (RTT) ;). Otherwise that's like saying “I'm staying in bed today cause I have thermometer”.