Search code examples
pythonpandasdataframechunking

How can I know the amount of chunks (or number of rows) in a csv file using before chunking?


I'm trying to process a 10GB+ csv file with pandas using a chunksize of 5.000.000. It takes some time to process the data from chunk to chunk, but I don't know in which part of the file I am. How can I know before starting the loop the amount of chunks (or number of rows) that the csv has? And is there any rule of thumb for choosing a chunksize?

The loop I'm using is something like this:

chunk = 5_000_000
data = pd.read_csv(path,chunksize=chunk)

for i in data:
     do_some_stuff()

Solution

  • Some ideas

    • estimate the line count before/without reading the file
    import os
    import sys
    
    full_size = os.path.getsize('test.csv')  # get size of file
    with open('test.csv','rb') as f:
        next(f)                              # skip header
        line_size = len(f.readline())        # get size of one line, assuming 1 byte encoding
    linecount = full_size // line_size + 1   # ~count of lines
    
    • estimate the available memory for the dataframe chunks
    import psutil
    
    chunksize = (
         psutil.virtual_memory().available 
     // (pd.read_csv('test.csv', nrows=1).memory_usage(deep=True).sum()
         * 4)) # leave memory for computations, here: 1/4 of available memory for the dataframe
    
    • use the information in your loop
    chunks = linecount // chunksize
    c = pd.read_csv('test.csv', chunksize=chunksize)
    for i,df in enumerate(c):
        print(f'working on chunk {i+1} of {chunks}')     # print or log
        df_sum = df.sum().sum()                          # do stuff ...
        print(f'sum in df in chunk {i}: {df_sum}')       # print or log