Search code examples
pythoncsvlarge-files

Python parsing large CSV file for usernames


I have a very large csv file (+50k lines).

This file contains IRC logs and here's the data format:

  • 1st column: Message type (1 for message, 2 for system)
  • 2nd column: Timestamps (numbers of seconds since a precise date)
  • 3rd column: Username of the one writing the message
  • 4th column: Message

Here's an example of the data:

1,1382445487956,"bob","i don't know how to do such a task"
1,1382025765196,"alice","bro ask stackoverflow"
1,1382454875476,"_XxCoder_killerxX_","I'm pretty sure it can be done with python, bob"
2,1380631520410,"helloman","helloman_ join the chan."

For example, _XxCoder_killerxX_ mentioned bob.

So, knowing all of this, I want to know which pair of usernames mentioned each others the most.

I want messages to be count, so I only need to work on lines starting with the number "1" (as there is a bunch of lines starting with "2" and other irrelevant numbers)

I know it can be done with the csv Python module, but I've never worked with such larges files so I really don't know how to start all of this.


Solution

  • You should perform two passes of the CSV: one to capture all sender usernames, the second to find sender usernames mentioned in messages.

    import csv
    
    users = set()
    
    with open("test.csv", "r") as file:
        reader = csv.reader(file)
        for line in reader:
            users.add(line[2])
    
    mentions = {}
    
    with open("test.csv", "r") as file:
        reader = csv.reader(file)
        for line in reader:
            sender, message = line[2], line[3]
            for recipient in users:
                if recipient == sender:
                    continue  # can't mention yourself
                if recipient in message:
                    key = (sender, recipient)
                    mentions[key] = mentions.get(key, 0) + 1
    
    for mention, times in mentions.items():
        print(f"{mention[0]} mentioned {mention[1]} {times} time(s)")
    
    
    totals = {}
    
    for mention, times in mentions.items():
        key = tuple(sorted(mention))
        totals[key] = totals.get(key, 0) + times
    
    for names, times in totals.items():
        print(f"{names[0]} and {names[1]} mentioned each other {times} time(s)")
    

    This example is naive, as it's performing simple substring matches. So, if there's someone named "foo" and someone mentions "food" in a message, it will indicate a match.