I recently downloaded the stackoverflow.com-Posts.7z
file from the Stack Exchange Data Dump. Upon extracting the .7z
file, I was left with a Posts.xml
file, which I converted to a Posts.csv
file using the "stackexchange-xml-converter" tool on GitHub. The Posts.csv
file contains all the posts that have been posted on the entire Stack Overflow website. The total size of the Posts.csv
file is ~67 GB, so it is too large to open in Microsoft Excel, Visual Studio Code, Notepad, etc.
Each row in that CSV file (apart form the first row, which is the header row) corresponds to all the data associated with only one, particular post. For example, here are just some of the data categories associated with each post: Title
, Tags
, ContentLicense
, ViewCount
, CommentCount
, CreationDate
, etc. Each data category is its own column in the CSV file. Here's an image of how it looks like:
My issue is that I am trying to count the frequency of only particular tags of interest from the Posts.csv file, given a list, in Python. For example, suppose I have the following list in Python:
tagsOfInterest = ['version-control', 'git', 'git-merge', 'bash', 'microservices']
Only in the Tags
column of the CSV file, I would like to count how many times the tag version-control
appears, how many times the tag git
appears, how many times the tag git-merge
appears, etc...
I've been struggling to do this because you'll notice that each row in the Tags
column is formatted as a continuous string, with each different tag word only separated by a <>
. For instance, in the first row, a post has been tagged with <version-control><projects-and-solutions><monorepo>
.
My original attempt involved first reading the Posts.csv
file, and then adding each row in the Tags
column to a list, as such:
from pandas import *
import csv
# Read data
data = read_csv("Posts.csv")
# Add each row in the "Tags" column to a list:
tags_col = data['Tags'].tolist()
and then my idea was to just tokenize each tag word. However, the Posts.csv
file is so large, that my computer runs out of memory just from creating the list!
As such, my question is: given a list of tags that are of interest, e.g., tagsOfInterest = ['version-control', 'git', 'git-merge', 'bash', 'microservices']
, how can I count the frequency of each element in that list, from the Tags
column of the Posts.CSV
file?
import csv
from collections import Counter
counts = Counter()
for row in csv.reader(open('Posts.csv')):
for tag in row[1].lstrip('<').rstrip('>').split('><'):
counts[tag] += 1
print(counts)
You can use a DictReader if you want, to use row['Tags']
instead of row[1]
.