Should I stay or should I Go?
-
Around October 2023 hostile ex-community members launched the first wave of DDoS attacks. The FAF infrastructure and application landscape was not prepared for that. Services directly connecting to our main servers on many open ports, openly accessible APIs for the benefit of the community. People could run their own IRC bots. People built API scrapers and analytics tools. All of that had to vanish basically overnight.
In a rush the FAF team closed down open ends as good as possible. We migrated the lobby connection from raw TCP to web sockets, we put the formerly open API behind authentication. We changed the IRC server in favour of an implementation supporting HTTP based access. And then we hid almost all services behind Cloudflare proxy servers except for the server itself that was still reachable from the internet.
The weak point remained our ICE server aka Coturn servers. So we started paying for external TURN as a service provider and added more infrastructure around. But the feedback on connectivity was never good. Things never got back to where it was. I started digging into the ice adapter more than any before. Documented features. Tried to refactor it. Tried to rewrite it. We even tried to integrate Cloudflares new TURN service as it went live. The connectivity was horrible and the used payment according to Cloudflare would have ruined FAF financially within a month, so we had to disable it, not knowing what actually happened there. At some point the DDoS more or less stopped and things settled a little more to normal and the topic of ICE fell into background noise.
Now for over two weeks now DDoS is back with cyber-terrorist alike demands (we either abide to the attackers terms or the DDoS continues forever). But the FAF team stance here is clear: We do not negotiate with terrorists.
Instead we continued hardening our servers. Our main server is no longer reachable directly from the internet, and has to pass multiple firewalls. Yet the bottleneck once again is the ICE connectivity.
So with the accumulated knowledge of the last years, we investigated and analysed the shit out of the ICE adapter with more tooling and (semi-)reproducible test methods. The results were not promising in multiple combinations:
- Our Hetzner cloud servers have huge packet loss even outside DDoS (tested on fresh VMs) on ICE-related communication. We don’t know why, but it seems that Hetzner really doesn’t like this kind of traffic. Potential solutions: (a) Report packet loss to Hetzner in a structured way, (b) use ports that are for other traffic and thus more stable, (3) use a different provider
- The „coturn“ software spew non-stop errors but was completely useless in logging why these errors occurred. So we tried out a different software called „eturnal“ (love the pun here), which gave us a better hint about problems
- A Wireshark capture of a user trying to connect to Cloudflare showed us a single successful connection attempt followed by 80000 (!!) failed connection attempts in a 5 minute interval.
So apart from the Hetzner issues, we could boil it down to problems in the ice adapter. The ice adapter at its core is built around the „ice4j“ library. This is a piece of software that originally built for the Jitsi phone software (even though it was renamed a few times and by now is a commercial service). The only maintainers are Jitsi developers and as such the focus lies on the features of Jitsi. There is a component called Jitsu video bridge that is also open source. When we looked into it, we saw that Jitsi is not using TURN at all, and therefore not a big priority in ice4j. The code of ice4j has no documentation outside of regular Javadocs. And it looks like it was written in a C-programming style from the 90s (while Jitsi components are written in modern Kotlin). The worst part however is, that it is not possible to control or configure from the calling code.
This should not be a problem if the library does what it should. But as far as we can see by now, the TURN code of ice4j does not behave like it should. Whether it is violating the specifications is beyond my understanding. A single example that we could identify is that ice4j tries to establish a TURN session for all the ip addresses it could find: the external ip (that one makes sense), but also all internal network ips (nope, that does not make sense!). And this in particular is the reason why the ice adapter causes an endless log stream of errors in coturn: the attempt to establish a turn connection for a private network address is causing an authentication error… (ok - both software stacks here behave like idiots). That might also be the reason why there are so many Cloudflare login attempts? We don’t know.
What we do know is that we can rewrite the ice adapter all we want. As long as we choose Java or Kotlin we are bound to ice4j as it is the only notable library for ICE.
What we do know, is that ice is a low level protocol which is used by WebRTC. Which is used by every single browser and every single audio/video conferencing tool that runs in a browser. WebRTC is everywhere and its „data channel“ feature allows features that we wished for in the ice adapter in a long time (guaranteed and order submission of packets, keep alive functionality). So why go with ICE alone when we can have WebRTC with ice?
Now, there is a striving project for WebRTC called Pion with 14k stars on Github (ice4j: 500) and around 200 contributors (ice4j: 25) and lots and lots of example code. So where is the catch? The catch is: It is written in Go. And we have no Go developers at FAF - I never used it so far.
So what do you think: Should I stay (on ice4j) or should I Go (learning Go)?
-
Thanks as always for the very interesting write up.
It seems to me that ice4j is a dead end, but are you actually willing to learn Go?
Even if everyone on the forum (aka roughly 3% of FAF players) agrees that it's the way to go, nobody can force you to learn a new language that you might or might not be interested in.
But it does sound like it might be the solution we (desperately?) need. Or perhaps there is some other option?
As an aside, your title made me really worried for a hot minute!!
-
@Brutus5000 said in Should I stay or should I Go?:
Although your statement is pretty clear:
What we do know is that we can rewrite the ice adapter all we want. As long as we choose Java or Kotlin we are bound to ice4j as it is the only notable library for ICE.
you are very likely the one who knows best how much time will it take to learn Go, and what we can expect until we're "Go ready" (asking you to fix the next DDoS wave while you're also learning Go feels evil). The call should be yours, as it is the time.
I will cheer for you behind the screen whatever the choice
-
I’ll let better-qualified people weigh in, but in the meantime just a big thank you to Brutus and all who are donating their free time to protect FAF from these imbeciles.
-
So the question is basically well-documented and maintained library in Go versus much less documented and less maintained library in Java.
Go doesn’t have high requirements. It’s designed to be relatively easy for learning provided developer is experienced in any other high level languages. So it sounds like Go option is better, especially long-term.
-
I also think that working in a high quality ecosystem is way less hassle than trying to work around a somewhat broken and abandoned library. Even if it means learning a new language. And Go is not nieche. If we ask around we might even find someone experienced with Go in our community that you can ask Go-specific questions
-
Had me panicking with that title!
Agreed with others if you're willing to try and learn Go or there's someone in the community who has used it then that sounds like the best route
-
Did you enjoy learning Kotlin? I've got almost no experience with golang, but I do like that if it has no errors in the ide it will generally work.
If you use terraform in your real world job its also handy since all the providers are in golang so makes troubleshooting easier(which is why I started trying to learn it).
If you do decide to try a prototype I'd be interested in learning more.
-
First of all, let me just give praise to how witty the title was. I feel that deserves recognition!
For all the uninitiated: https://www.youtube.com/watch?v=BN1WwnEDWAM
If you give it a listen, you'll notice that the lyrics kinda describe what the DevOps had been up to since the DDoS issues started back in 2023Anyways, as the song says, if you Go, there will be trouble, but if you stay, it will be double.
If you are indeed interested in learning Go, I think that would be amazing. I'm also going to presume that you yourself see this as an opportunity for yourself to use FAF as a practice target in your path to learn Go for your own personal fulfillment. And I myself support this 100% - that is to say, you should Go with that. Of course, if I'm mistaken in my analysis on motivations, do correct me, but I'm positive there's some good motivation behind it, otherwise you wouldn't bother with your detailed analysis of the problem, and this amazingly succinct brief!
As always, thank you personally but also all the other DevOps boys for all the work you do. Some might see your work as bringing FAF down to its knees, but what you are really doing is keeping us up on our knees in front of the chopping block.
An to end this on a musical note, this one's title should both serve as my viewpoint on both what you should do, and what my qualifications are on the topic: https://www.youtube.com/watch?v=GwpCb0qW-6Y
-
If WebRTC fits, move forward with it. People will be more accepting of the disconnects knowing there is a path forward.
Let us know what help is needed. Your leadership is appreciated.
-
What would be the problem with sending a few requests to freelancers with a description of the problem and where the journey should Go and negotiating a fixed price via sites such as fiverr?
If you have an offer for what it should cost, you could start a fundraising campaign to finance the whole thing.
Learning from 0 is honorable (Go), but time is a factor and the current connection problems are really bad.
You can then have this explained to you in the process and also have a much higher learning curve so that you can continue to work on it yourself if necessary. -
@maudlin27 said in Should I stay or should I Go?:
Had me panicking with that title!
Almost had a fucking heart attack. This is evil.
Go sounds like it may be easier, though only you can really decide if you have the time/motivation/available energy to learn Go. I don't know enough programming to make recommendations, but something that is much more supported and popular probably will make a lot of things easier. More likely to stumble upon someone with the knowledge to help with our unique environment too.
-
That title scared me.
Thank you for all the work.
On the question; I will defer to those with more knowledge on this issue.