In game chat dump from 631 711 replays and 23 929 players

@femtozetta there is nearly 15 million total replays

So this is about 1 year?

I would kindly request you not to analyse peoples chat messages and then publish names out of that.

People did not consent to this and FAF could get in trouble for such a thing.

For things like who played against whom, replays are a bad start. Its easier to scrape the api and put it in a graph database (SQL is incredibly inefficient on this particular kind of queies)

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

@brutus5000 Arent all the replays and their chat histories searchable by username? Like, publicly?

You must deceive the enemy, sometimes your allies, but you must always deceive yourself!

Your camera doesn't stop you to film other people. Yet if you upload a video to YouTube that is not allowed.
Similar to IRC. You can keep a chat log of everything. But publishing it is a grey area.

Just to get this right:

  • You are allowed to analyze this data for private purposes.
  • You are allowed to share insights (e.g. see the funny cloud diagram of most used words in chat)
  • You are not allowed to analyze single person behaviors or do a social rating and publish this (or basically do anything that relates back to a single user)

If I see it on the official FAF platform anywhere I will remove it and give that person a warning (or have moderation do it).
However, I cannot prevent you from publishing it elsewhere and if you do so and it falls back to us legally in worst case we'll have to take other measures whatever they might be (e.g. take down the replay vault as the final action if no other solution).

That's why I'm asking kindly for now. Just let's stay out of trouble and everybody can still have fun.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

2 questions from an amateur:

  1. Could you upload a bigger dataset as well?
  2. Is there some webservice or database or something where I can translate userid to username easily?

Ban Anime

The chat logs are already published in the vault so I don't see how this is any different tbh

You don't see a difference between browsing through chat logs manually and mass-profiling single users and publish the results?

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

I mean, it's just presenting public information in a more organized and readable way. It's the same difference as instead of looking at rating changes from each game in the replay vault there is a tool that shows you rating changes of each user every day and other similar tools eg. Kazbeks tool that allows you to see what map a specific user plays. Yes, in theory, it can be used in a harmful way, idk shaming someone for thrash-talking people or swearing at them or something.

Not to mention commands like top words that achieve the same thing but on a smaller scale.

How would FAF not be liable for some information issue here?

They :

  1. have made replay information public for everyone
  2. have made a parser to allow you to, uh, parse this information and even included instructions on how to use it

No idea about Europe but in the US there is a liability doctrine that doesn't let you just give a person tools, say "don't do that bad thing with the tools" and then wash your hands when you put zero effort into making it difficult to actually do said bad thing.

The only thing FAF hasn't done is give you step-by-step instructions on how to download replays from the vault to then use the tool.

I mean I don't get the issue in the first place, do people have legal ownership over the words they write in game or something? Wouldn't this already make the replay vault a "legal liability" unless you requested consent before publishing any replay?

Also, "You are not allowed to analyze single person behaviors or do a social rating and publish this (or basically do anything that relates back to a single user)" isn't this essentially what moderation does? Don't report results get reported back to the person that made the report? That's a publication of the analysis of a singular person's behavior.

I dont mean to have caused any legal trouble here, I was just interested in some data analysis. I should probably have started with things other than text chat first and gauge a response but text chat was the easiest for me to parse and make sense of.

from my perspective the information is already highly available in the replay vault publicly.

I do understand open source data can become sensitive when massed together.

Perhaps we need a disclaimer that replays and all information contained are available publicly? along with name history and rating history and anything else, ect? It has always been very obvious to me that they are but to others it may not be?

Legality aside there are clear morality concerns. A LOT of miscellaneous personal information is public on the internet if you try hard enough to search for it, but collecting and publishing it on public forums is not really ok. Argument that "it's already public" only stands up if you delve into technicalities. And if we had some certain moderator still active giving them this kind of idea would likely result into a mass mega ban or a big drama fest.

Everything is open assuming good will:
a) don't misuse the data
b) don't cause performance issues on the server

As long as everybody behaves we're good. If I see misuse I'll shut it down / make it unavailable to the public.
So far no lines where crossed, but I hope I made my point clear where the red lines are.

So @Nooby you did not cause any trouble yet. I just tried to proactively step in before things go in the wrong direction.

"Nerds have a really complicated relationship with change: Change is awesome when WE'RE the ones doing it. As soon as change is coming from outside of us it becomes untrustworthy and it threatens what we think of is the familiar."
– Benno Rice

11

APOLOGIES THIS POST IS IN CODE FORMAT - it was the only way I could show my post while keeping the tabination of the word & count tables.. 🙂

Currently learning python and decided to play with and analyse this dataset just out of curiosity.

It contains 20,405,216 words, spread across 23,929 files (representing that number of games)  for a total data size of 112+MB

I removed the 2,760,185 non English words as I am only able to speak one language. So that's all the Russian, German etc words removed.

So there are 17,645,031 English words remaining, let's look at these.

None of the following proves anything, I just thought it would be interesting to have a look.

What are the actual most commonly used words?


WORD		COUNT
----------	-------
to		1464477
sent		1305978
mass		743675
energy		667411
you		274816
me		263903
i		245113
give		223129
gg		179505
can		151937
the		131350

Nothing too suprising there. Let's look at some other word counts now.

Other words of note very commonly used:

WORD		COUNT
----------	-------
air		92538
units		83868
lol		66198
unit		60286
t3		51683
need		49956
dont		48234
why		30133
help		25714



How friendly are the games?

WORD		COUNT
----------	-------
pls		63084
gl		41712
hf		39946
plz		26900
ty		26191
nice		26087
please		26005
glhf		19296
thx		16672
sorry		15812
thanks		8638
sry		5498



How toxic are the games? Actually, not as much as I might have worried..

WORD		COUNT
----------	-------
fuck		21997
shit		19482
fucking		16963
frustrating	16911
fucked		6089
ffs		5988
damn		5585
idiot		5494
ass		3068
asshole		800



What about issues in the game?

WORD		COUNT
----------	-------
lag		13433
re		19892
kick		10327
afk		5564
lagging		5035
eject		4792
lags		4251



How often are the game enders mentioned?

WORD		COUNT
----------	-------
nuke		28959
mavor		13966
para		6724
paragon		3185
scathis		4679
yolo		4106
novax		1669
yolona		1296
salvation	1047



And the experimental units?

WORD		COUNT
----------	-------
spider		6160
monkey		4272
gc		4031
chicken		2145
fatboy		2340
czar		1874
mega		1650
fatty		1335
tempest		1188
ahwassa		1083
monkeylord	501
megalith	473
ythotha		449
ripper		384
atlantis	379
asswasher	348
colossus	312
soulripper	55


Which races get talked about most? Presumably due to asking for engineers to make Hives and Kennels:

WORD		COUNT
----------	-------
cybran		13614
uef		13065
aeon		8793
sera		5768
seraphim	1088 (much faster to just type sera!)



How are the commanders referred to?

WORD		COUNT
----------	-------
com		11785
acu		11555


Does playing FAF give you headaches? Because ibuprofen is mentioned 187 times.

@scout_more_often Dont apologize! This looks amazing dude! Could even make a graphic out of this data! a FAF interesting Chat Facts sheet!

FAF Website Developer

Would be an interesting fun-fact news. The ibuprofen thing is funny.

How does one find their user-id?