Search code examples
python-3.xcsviterationfrequency

Count Occurrences for Objects in a Column of Lists for Really Large CSV File


I have a huge CSV file (8gb) containing multiple columns. One of the columns are a column of lists that looks like this:

     YEAR              WIN_COUNTRY_ISO3
200  2017         ['BEL', 'FRA', 'ESP']
201  2017                ['BEL', 'LTU']
202  2017                ['POL', 'BEL']
203  2017                       ['BEL']
204  2017  ['GRC', 'DEU', 'FRA', 'LVA']
205  2017                       ['LUX']
206  2017         ['BEL', 'SWE', 'LUX']
207  2017                       ['BEL']
208  2017                            []
209  2017                            []
210  2017                            []
211  2017                       ['BEL']
212  2017                       ['SWE']
213  2017                ['LUX', 'LUX']
214  2018                ['DEU', 'LUX']
215  2018                ['ESP', 'PRT']
216  2018                       ['AUT']
217  2018                ['DEU', 'BEL']
218  2009                       ['ESP']
219  2009                       ['BGR']

Each of the 3-letter code represents a country. I would like to create a frequency table for each country so i can count the occurrences of each country in the entire column. Since the file is really large and my PC can't handle to load the whole CSV as dataframes, I try to read the file lazily and iterate through the line --> getting the last column and add the object in each row of the WIN_COUNTRY_ISO3 column (which happens to be the last column) to a set of dictionary.

import sys
from itertools import islice
n=100
i = 0
col_dict={}
with open(r"filepath.csv") as file:
    for nline in iter(lambda: tuple(islice(file, n)), ()):
        row = nline.splitline
        WIN_COUNTRY_ISO3 = row[-1]
        for iso3 in WIN_COUNTRY_ISO3:
            if iso3 in col_dict.keys():
                col_dict[iso3]+=1
            else:
                col_dict[iso3]=1
        i+=1
        sys.stdout.write("\rDoing thing %i" % i)
        sys.stdout.flush()
    print(col_dict)

However, this process takes a really long time. I tried through iterate through multiple lines by using the code

for nline in iter(lambda: tuple(islice(file, n)), ())

Q1:

However, this doesn't seem to work and python process the file one by one. Does anybody know the most any efficient way for me to generate the count of each country for a really large file like mine?

The resulting table would look like this:

Country     Freq
BEL         4543
FRA         4291
ESP         3992
LTU         3769
POL         3720
GRC         3213
DEU         3119
LVA         2992
LUX         2859
SWE         2802
PRT         2584
AUT         2374
BGR         1978
RUS         1770
TUR         1684

I would also like to create the frequency table by each year (in the YEAR column) if anybody can help me with this. Thank you.


Solution

  • Try this:

    from collections import defaultdict
    import csv
    import re
    
    result = defaultdict(int)
    f = open(r"filepath.csv")
    next(f)
    for row in f:
        data = re.sub(r'[\s\d\'\[\]]', '', row)
        if data:
            for x in data.split(','):
                result[x] += 1
    print(result)