Search code examples
pythonlistcsvfrequencystackexchange-api

How to count the frequency of specific Tags from the Stack Overflow data dump's CSV file in Python


I recently downloaded the stackoverflow.com-Posts.7z file from the Stack Exchange Data Dump. Upon extracting the .7z file, I was left with a Posts.xml file, which I converted to a Posts.csv file using the "stackexchange-xml-converter" tool on GitHub. The Posts.csv file contains all the posts that have been posted on the entire Stack Overflow website. The total size of the Posts.csv file is ~67 GB, so it is too large to open in Microsoft Excel, Visual Studio Code, Notepad, etc.

Each row in that CSV file (apart form the first row, which is the header row) corresponds to all the data associated with only one, particular post. For example, here are just some of the data categories associated with each post: Title, Tags, ContentLicense, ViewCount, CommentCount, CreationDate, etc. Each data category is its own column in the CSV file. Here's an image of how it looks like:

Structure of the Posts.csv file from the Stack Exchange Data Dump

My issue is that I am trying to count the frequency of only particular tags of interest from the Posts.csv file, given a list, in Python. For example, suppose I have the following list in Python:

tagsOfInterest = ['version-control', 'git', 'git-merge', 'bash', 'microservices']

Only in the Tags column of the CSV file, I would like to count how many times the tag version-control appears, how many times the tag git appears, how many times the tag git-merge appears, etc...

I've been struggling to do this because you'll notice that each row in the Tags column is formatted as a continuous string, with each different tag word only separated by a <>. For instance, in the first row, a post has been tagged with <version-control><projects-and-solutions><monorepo>.

My original attempt involved first reading the Posts.csv file, and then adding each row in the Tags column to a list, as such:

from pandas import *
import csv

# Read data
data = read_csv("Posts.csv")

# Add each row in the "Tags" column to a list:
tags_col = data['Tags'].tolist()

and then my idea was to just tokenize each tag word. However, the Posts.csv file is so large, that my computer runs out of memory just from creating the list!

As such, my question is: given a list of tags that are of interest, e.g., tagsOfInterest = ['version-control', 'git', 'git-merge', 'bash', 'microservices'], how can I count the frequency of each element in that list, from the Tags column of the Posts.CSV file?


Solution

  • import csv
    from collections import Counter
    
    counts = Counter()
    for row in csv.reader(open('Posts.csv')):
        for tag in row[1].lstrip('<').rstrip('>').split('><'):
            counts[tag] += 1
    print(counts)
    

    You can use a DictReader if you want, to use row['Tags'] instead of row[1].