I have a very large csv file (+50k lines).
This file contains IRC logs and here's the data format:
Here's an example of the data:
1,1382445487956,"bob","i don't know how to do such a task"
1,1382025765196,"alice","bro ask stackoverflow"
1,1382454875476,"_XxCoder_killerxX_","I'm pretty sure it can be done with python, bob"
2,1380631520410,"helloman","helloman_ join the chan."
For example, _XxCoder_killerxX_
mentioned bob
.
So, knowing all of this, I want to know which pair of usernames mentioned each others the most.
I want messages to be count, so I only need to work on lines starting with the number "1" (as there is a bunch of lines starting with "2" and other irrelevant numbers)
I know it can be done with the csv
Python module, but I've never worked with such larges files so I really don't know how to start all of this.
You should perform two passes of the CSV: one to capture all sender usernames, the second to find sender usernames mentioned in messages.
import csv
users = set()
with open("test.csv", "r") as file:
reader = csv.reader(file)
for line in reader:
users.add(line[2])
mentions = {}
with open("test.csv", "r") as file:
reader = csv.reader(file)
for line in reader:
sender, message = line[2], line[3]
for recipient in users:
if recipient == sender:
continue # can't mention yourself
if recipient in message:
key = (sender, recipient)
mentions[key] = mentions.get(key, 0) + 1
for mention, times in mentions.items():
print(f"{mention[0]} mentioned {mention[1]} {times} time(s)")
totals = {}
for mention, times in mentions.items():
key = tuple(sorted(mention))
totals[key] = totals.get(key, 0) + times
for names, times in totals.items():
print(f"{names[0]} and {names[1]} mentioned each other {times} time(s)")
This example is naive, as it's performing simple substring matches. So, if there's someone named "foo" and someone mentions "food" in a message, it will indicate a match.