Around October 2023 hostile ex-community members launched the first wave of DDoS attacks. The FAF infrastructure and application landscape was not prepared for that. Services directly connecting to our main servers on many open ports, openly accessible APIs for the benefit of the community. People could run their own IRC bots. People built API scrapers and analytics tools. All of that had to vanish basically overnight.
In a rush the FAF team closed down open ends as good as possible. We migrated the lobby connection from raw TCP to web sockets, we put the formerly open API behind authentication. We changed the IRC server in favour of an implementation supporting HTTP based access. And then we hid almost all services behind Cloudflare proxy servers except for the server itself that was still reachable from the internet.
The weak point remained our ICE server aka Coturn servers. So we started paying for external TURN as a service provider and added more infrastructure around. But the feedback on connectivity was never good. Things never got back to where it was. I started digging into the ice adapter more than any before. Documented features. Tried to refactor it. Tried to rewrite it. We even tried to integrate Cloudflares new TURN service as it went live. The connectivity was horrible and the used payment according to Cloudflare would have ruined FAF financially within a month, so we had to disable it, not knowing what actually happened there.
At some point the DDoS more or less stopped and things settled a little more to normal and the topic of ICE fell into background noise.
Now for over two weeks now DDoS is back with cyber-terrorist alike demands (we either abide to the attackers terms or the DDoS continues forever). But the FAF team stance here is clear: We do not negotiate with terrorists.
Instead we continued hardening our servers. Our main server is no longer reachable directly from the internet, and has to pass multiple firewalls. Yet the bottleneck once again is the ICE connectivity.
So with the accumulated knowledge of the last years, we investigated and analysed the shit out of the ICE adapter with more tooling and (semi-)reproducible test methods. The results were not promising in multiple combinations:
- Our Hetzner cloud servers have huge packet loss even outside DDoS (tested on fresh VMs) on ICE-related communication. We don’t know why, but it seems that Hetzner really doesn’t like this kind of traffic.
Potential solutions: (a) Report packet loss to Hetzner in a structured way, (b) use ports that are for other traffic and thus more stable, (3) use a different provider
- The „coturn“ software spew non-stop errors but was completely useless in logging why these errors occurred. So we tried out a different software called „eturnal“ (love the pun here), which gave us a better hint about problems
- A Wireshark capture of a user trying to connect to Cloudflare showed us a single successful connection attempt followed by 80000 (!!) failed connection attempts in a 5 minute interval.
So apart from the Hetzner issues, we could boil it down to problems in the ice adapter. The ice adapter at its core is built around the „ice4j“ library. This is a piece of software that originally built for the Jitsi phone software (even though it was renamed a few times and by now is a commercial service). The only maintainers are Jitsi developers and as such the focus lies on the features of Jitsi. There is a component called Jitsu video bridge that is also open source. When we looked into it, we saw that Jitsi is not using TURN at all, and therefore not a big priority in ice4j. The code of ice4j has no documentation outside of regular Javadocs. And it looks like it was written in a C-programming style from the 90s (while Jitsi components are written in modern Kotlin). The worst part however is, that it is not possible to control or configure from the calling code.
This should not be a problem if the library does what it should. But as far as we can see by now, the TURN code of ice4j does not behave like it should. Whether it is violating the specifications is beyond my understanding. A single example that we could identify is that ice4j tries to establish a TURN session for all the ip addresses it could find: the external ip (that one makes sense), but also all internal network ips (nope, that does not make sense!). And this in particular is the reason why the ice adapter causes an endless log stream of errors in coturn: the attempt to establish a turn connection for a private network address is causing an authentication error… (ok - both software stacks here behave like idiots). That might also be the reason why there are so many Cloudflare login attempts? We don’t know.
What we do know is that we can rewrite the ice adapter all we want. As long as we choose Java or Kotlin we are bound to ice4j as it is the only notable library for ICE.
What we do know, is that ice is a low level protocol which is used by WebRTC. Which is used by every single browser and every single audio/video conferencing tool that runs in a browser. WebRTC is everywhere and its „data channel“ feature allows features that we wished for in the ice adapter in a long time (guaranteed and order submission of packets, keep alive functionality). So why go with ICE alone when we can have WebRTC with ice?
Now, there is a striving project for WebRTC called Pion with 14k stars on Github (ice4j: 500) and around 200 contributors (ice4j: 25) and lots and lots of example code. So where is the catch? The catch is: It is written in Go. And we have no Go developers at FAF - I never used it so far.
So what do you think: Should I stay (on ice4j) or should I Go (learning Go)?