Python parsing large CSV file for usernames

I have a very large csv file (+50k lines).

This file contains IRC logs and here's the data format:

1st column: Message type (1 for message, 2 for system)
2nd column: Timestamps (numbers of seconds since a precise date)
3rd column: Username of the one writing the message
4th column: Message

Here's an example of the data:

1,1382445487956,"bob","i don't know how to do such a task"
1,1382025765196,"alice","bro ask stackoverflow"
1,1382454875476,"_XxCoder_killerxX_","I'm pretty sure it can be done with python, bob"
2,1380631520410,"helloman","helloman_ join the chan."

For example, _XxCoder_killerxX_ mentioned bob.

So, knowing all of this, I want to know which pair of usernames mentioned each others the most.

I want messages to be count, so I only need to work on lines starting with the number "1" (as there is a bunch of lines starting with "2" and other irrelevant numbers)

I know it can be done with the csv Python module, but I've never worked with such larges files so I really don't know how to start all of this.

Solution

You should perform two passes of the CSV: one to capture all sender usernames, the second to find sender usernames mentioned in messages.

import csv

users = set()

with open("test.csv", "r") as file:
    reader = csv.reader(file)
    for line in reader:
        users.add(line[2])

mentions = {}

with open("test.csv", "r") as file:
    reader = csv.reader(file)
    for line in reader:
        sender, message = line[2], line[3]
        for recipient in users:
            if recipient == sender:
                continue  # can't mention yourself
            if recipient in message:
                key = (sender, recipient)
                mentions[key] = mentions.get(key, 0) + 1

for mention, times in mentions.items():
    print(f"{mention[0]} mentioned {mention[1]} {times} time(s)")


totals = {}

for mention, times in mentions.items():
    key = tuple(sorted(mention))
    totals[key] = totals.get(key, 0) + times

for names, times in totals.items():
    print(f"{names[0]} and {names[1]} mentioned each other {times} time(s)")

This example is naive, as it's performing simple substring matches. So, if there's someone named "foo" and someone mentions "food" in a message, it will indicate a match.