Search code examples
pythonpython-3.xdatasetanalysis

A simpler way to create a dictionary with counts from a 43 million row text file?


Context: I have a file with ~44 million rows. Each is an individual with US address, so there's a "ZIP Code" field. File is txt, pipe-delimited.

Due to size, I cannot (at least on my machine) use Pandas to analyze. So a basic question I have is: How many records (rows) are there for each distinct ZIP code? I took the following steps, but I wonder if there's a faster, more Pythonic way to do this (seems like there is, I just don't know).

Step 1: Create a set for ZIP values from file:

output = set()

with open(filename) as f:
  for line in f:
    output.add(line.split('|')[8]  # 9th item in the split string is "ZIP" value

zip_list = list(output)  # List is length of 45,292

Step 2: Created a "0" list, same length as first list:

zero_zip = [0]*len(zip_list)

Step 3: Created a dictionary (with all zeroes) from those two lists:

zip_dict = dict(zip(zip_list, zero_zip))  

Step 4: Lastly I ran through the file again, this time updating the dict I just created:

with open(filename) as f:
    next(f)  # skip first line, which contains headers
    for line in f:
        zip_dict[line.split('|')[8]] +=1

I got the end result but wondering if there's a simpler way. Thanks all.


Solution

  • Creating the zip_dict can be replaced with a defaultdict. If you can run through every line in the file, you don't need to do it twice, you can just keep a running count.

    from collections import defaultdict
    
    d = defaultdict(int)
    
    with open(filename) as f:
        for line in f:
            parts = line.split('|')
            d[parts[8]] += 1