Replay vault download

How would one go about downloading the entire replay vault and replay vault database / search tools? What sort of size is it?

It is very slow for me to navigate, and a local copy may be much faster

The whole set of all replays (which doesn't really make sense) takes around 500GB, but is not available as a single download.
The search tools are not available for offline use either.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

How are replays searched? Can metadata about the game be easily extracted? Is that metadata currently extracted and put into a database? I am intrested in improving this

@nooby said in Replay vault download:

It is very slow for me to navigate

What does very slow mean and what are you trying to do? For the normal replay searches (searching for a player, searching for a map, etc.) it always felt fast enough for me.

@blackyps What client are you using? For me I have not had that experience, with it hanging and needed a reFAF to fix. Searching for player, rating, map, number of players with advanced search options

Typically the latest client, but it shouldn't matter too much, because I don't think there were any client changes to the searches recently. Are you using Player Name "contains" instead of "is"? I just tested different stuff and it was all reasonably fast except that one. The database of all replays is really huge and checking for substrings in the players is really complicated. So if you know the player name you can use "is" and massively speed up the search. If you use the filter search, it already does this for you.
If you absolutely have to run a complicated query you can limit the time range of the replays to speed it up. You are probably not that interested in games a long time ago anyway.

@blackyps Ah, I was using map contains and player contains so that must be why, need to optimsie the query.

I would still like a way to download every replay with:
say at least two players
at least one player over 1000 rating
no astro craters

for archival purposes.

Would it be possible to "cull" the replay vault? Replays are small, but I would wager games with one player in sandbox mode could stand to be removed. Perhaps a separate long term archive with things over a few years old would be helpful.

You must deceive the enemy, sometimes your allies, but you must always deceive yourself!

We discussed that a few times. It's technically very challenging for no real benefit. So while it is technically possible, it won't happen.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

I wouldn't call it very challenging, it's just some work. We have enough space on the disk to last us a few more years, so it's not an urgent issue.

https://replay.faforever.com/15487505

so, one could run an incremental wget script over a month or so against https://replay.faforever.com to download them all, rate limited to prevent ddos

header
{"uid": 15487505, "complete": true, "state": "PLAYING", "featured_mod": "faf", "game_type": "0", "recorder": "Kekomander", "host": "Kekomander", "launched_at": 1633885469.0, "game_end": 1633887433.0, "title": "1.3k pain", "mapname": "scmp_009", "num_players": 8, "teams": {"3": ["ZmeiGorinich", "Kekomander", "PhantomSamurai", "Greedyscoobs"], "2": ["JT_", "AlphaNoob", "DEVOTION", "Nooby"]}, "featured_mod_versions": {"1": 3724, "2": 3724, "3": 3634, "4": 3709, "5": 1, "6": 3724, "8": 1, "9": 1, "11": 3724, "12": 3724, "13": 3724, "14": 3724, "15": 3724, "17": 3677, "18": 3724, "19": 3724, "20": 3724, "21": 3724, "22": 3724}, "version": 2, "compression": "zstd"}

so, just parse the header line into your postresql and your good, you got your own offline searchable vault
how does it figure out the vistory condition?

Suggestions for improvement to search terms

search by elapsed game time
Serach by map size
have a saveable search profile for advances search

There are actually 2 different headers. The one added by FAF that you see there, and the original one included in the replay data. The original gpg header will have pretty much all metadata that you could want about the replay including game options, players, mods, etc. To get it though you’ll have to use a parser that knows the binary format of the header. I’ve written one and made a post about it here: https://forum.faforever.com/topic/1551/faf-scfa-replay-parser-library but there are a number of different implementations out there in a variety of languages.

@askaholic thank you, this is relevent and cool and something I am going to have a play with.

@mazornoob said in Replay vault download:

I wouldn't call it very challenging, it's just some work. We have enough space on the disk to last us a few more years, so it's not an urgent issue.

I do - in terms of FAF complexity that is what I call very challenging. It's not just moving around files. Just a few things that come up:

  • Identifying proper criteria (it must catch enough replays to make a difference, but not remove any "important" cases). There are a few dozen different opinions and once you settle with a common understanding you've got to check that your database is holding that data consistently (which it usually doesn't) + filtering them in a way that it doesn't overload the server
  • What do you do with table entries. You can't delete them, foreign key constraints don't allow that (review, moderation reports, ...). Maybe you can partition them, but again you're playing with fire on a live system in the 2 biggest tables you have. Then you need to make use of indices in the client so that the partitioning actually has an effect. Or you add some more flags or whatever
  • In case you move them elsewhere make sure they are able for download there. Suddenly you need to resolve urls by business logic (right now it's just a redirect into the weird folder structure)
  • How do you mount additional/external storage on the server

It's a problem that covers almost areas of FAF: Database, API, Client, Server structure. These are the worst.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

Picking good criteria is hard, sure, but for the database we have an 'is replay available' flag, don't we? We can just flip it and that's it, database entries can stay as they are. There's also no "move elsewhere" question if we choose to delete them 🙂

The point in discussion was "Perhaps a separate long term archive with things over a few years old would be helpful."
Just dropping them was heavily opposed when discussed a few months ago.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

Alright, I misunderstood what the "that" was, sorry.

the replay parser that @askaholic linked can extract a buinch of usefull information that could be used for filter querys to add to the dababase - for example if replay deynced, and if so what exact time.

Also for moderation, automatically extracting that chat of every replay to another database

It can also be done on demand with the OG web based parser https://fafafaf.github.io/. So for moderation purposes there isn’t a need to extract everything ahead of time. That would mostly be a ton of data that nobody looks at.

I have a python3 wget script that grabs the direct link and saves the replays, incrementally. It is rate limited to stop ddos but if anyone would like it here it is.

replaygrabber.zip