Replay vault download

I wouldn't call it very challenging, it's just some work. We have enough space on the disk to last us a few more years, so it's not an urgent issue.

https://replay.faforever.com/15487505

so, one could run an incremental wget script over a month or so against https://replay.faforever.com to download them all, rate limited to prevent ddos

header
{"uid": 15487505, "complete": true, "state": "PLAYING", "featured_mod": "faf", "game_type": "0", "recorder": "Kekomander", "host": "Kekomander", "launched_at": 1633885469.0, "game_end": 1633887433.0, "title": "1.3k pain", "mapname": "scmp_009", "num_players": 8, "teams": {"3": ["ZmeiGorinich", "Kekomander", "PhantomSamurai", "Greedyscoobs"], "2": ["JT_", "AlphaNoob", "DEVOTION", "Nooby"]}, "featured_mod_versions": {"1": 3724, "2": 3724, "3": 3634, "4": 3709, "5": 1, "6": 3724, "8": 1, "9": 1, "11": 3724, "12": 3724, "13": 3724, "14": 3724, "15": 3724, "17": 3677, "18": 3724, "19": 3724, "20": 3724, "21": 3724, "22": 3724}, "version": 2, "compression": "zstd"}

so, just parse the header line into your postresql and your good, you got your own offline searchable vault
how does it figure out the vistory condition?

Suggestions for improvement to search terms

search by elapsed game time
Serach by map size
have a saveable search profile for advances search

There are actually 2 different headers. The one added by FAF that you see there, and the original one included in the replay data. The original gpg header will have pretty much all metadata that you could want about the replay including game options, players, mods, etc. To get it though you’ll have to use a parser that knows the binary format of the header. I’ve written one and made a post about it here: https://forum.faforever.com/topic/1551/faf-scfa-replay-parser-library but there are a number of different implementations out there in a variety of languages.

@askaholic thank you, this is relevent and cool and something I am going to have a play with.

@mazornoob said in Replay vault download:

I wouldn't call it very challenging, it's just some work. We have enough space on the disk to last us a few more years, so it's not an urgent issue.

I do - in terms of FAF complexity that is what I call very challenging. It's not just moving around files. Just a few things that come up:

  • Identifying proper criteria (it must catch enough replays to make a difference, but not remove any "important" cases). There are a few dozen different opinions and once you settle with a common understanding you've got to check that your database is holding that data consistently (which it usually doesn't) + filtering them in a way that it doesn't overload the server
  • What do you do with table entries. You can't delete them, foreign key constraints don't allow that (review, moderation reports, ...). Maybe you can partition them, but again you're playing with fire on a live system in the 2 biggest tables you have. Then you need to make use of indices in the client so that the partitioning actually has an effect. Or you add some more flags or whatever
  • In case you move them elsewhere make sure they are able for download there. Suddenly you need to resolve urls by business logic (right now it's just a redirect into the weird folder structure)
  • How do you mount additional/external storage on the server

It's a problem that covers almost areas of FAF: Database, API, Client, Server structure. These are the worst.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

Picking good criteria is hard, sure, but for the database we have an 'is replay available' flag, don't we? We can just flip it and that's it, database entries can stay as they are. There's also no "move elsewhere" question if we choose to delete them 🙂

The point in discussion was "Perhaps a separate long term archive with things over a few years old would be helpful."
Just dropping them was heavily opposed when discussed a few months ago.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

Alright, I misunderstood what the "that" was, sorry.

the replay parser that @askaholic linked can extract a buinch of usefull information that could be used for filter querys to add to the dababase - for example if replay deynced, and if so what exact time.

Also for moderation, automatically extracting that chat of every replay to another database

It can also be done on demand with the OG web based parser https://fafafaf.github.io/. So for moderation purposes there isn’t a need to extract everything ahead of time. That would mostly be a ton of data that nobody looks at.

I have a python3 wget script that grabs the direct link and saves the replays, incrementally. It is rate limited to stop ddos but if anyone would like it here it is.

replaygrabber.zip