Context: I have a file with ~44 million rows. Each is an individual with US address, so there's a "ZIP Code" field. File is txt, pipe-delimited.
Due to size, I cannot (at least on my machine) use Pandas to analyze. So a basic question I have is: How many records (rows) are there for each distinct ZIP code? I took the following steps, but I wonder if there's a faster, more Pythonic way to do this (seems like there is, I just don't know).
Step 1: Create a set for ZIP values from file:
output = set()
with open(filename) as f:
for line in f:
output.add(line.split('|')[8] # 9th item in the split string is "ZIP" value
zip_list = list(output) # List is length of 45,292
Step 2: Created a "0" list, same length as first list:
zero_zip = [0]*len(zip_list)
Step 3: Created a dictionary (with all zeroes) from those two lists:
zip_dict = dict(zip(zip_list, zero_zip))
Step 4: Lastly I ran through the file again, this time updating the dict I just created:
with open(filename) as f:
next(f) # skip first line, which contains headers
for line in f:
zip_dict[line.split('|')[8]] +=1
I got the end result but wondering if there's a simpler way. Thanks all.
Creating the zip_dict
can be replaced with a defaultdict
. If you can run through every line in the file, you don't need to do it twice, you can just keep a running count.
from collections import defaultdict
d = defaultdict(int)
with open(filename) as f:
for line in f:
parts = line.split('|')
d[parts[8]] += 1