Search code examples
pythonpandaschunks

Load pandas dataframe with chunksize determined by column variable


If I have a csv file that's too large to load into memory with pandas (in this case 35gb), I know it's possible to process the file in chunks, with chunksize.

However I want to know if it's possible to change chunksize based on values in a column.

I have an ID column, and then several rows for each ID with information, like this:

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...

I don't want to separate IDs into different chunks. for example chunks of size 4 would be processed:

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk

ID,   Time,  x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...

Is it possible?

If not perhaps using the csv library with a for loop along the lines of:

for line in file:
    x += 1
    if x > 1000000 and curid != line[0]:
        break
    curid = line[0]
    #code to append line to a dataframe

although I know this would only create one chunk, and for loops take a long time to process.


Solution

  • If you iterate through the csv file line by line, you can yield chunks with a generator dependent on any column.

    Working example:

    import pandas as pd
    
    def iter_chunk_by_id(file):
        csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
        first_chunk = csv_reader.get_chunk()
        id = first_chunk.iloc[0,0]
        chunk = pd.DataFrame(first_chunk)
        for l in csv_reader:
            if id == l.iloc[0,0]:
                id = l.iloc[0,0]
                chunk = chunk.append(l)
                continue
            id = l.iloc[0,0]
            yield chunk
            chunk = pd.DataFrame(l)
        yield chunk
    
    ## data.csv ##
    # 1, foo, bla
    # 1, off, aff
    # 2, roo, laa
    # 3, asd, fds
    # 3, qwe, tre
    # 3, tre, yxc   
    
    chunk_iter = iter_chunk_by_id("data.csv")
    
    for chunk in chunk_iter:
        print(chunk)
        print("_____")
    

    Output:

       0     1     2
    0  1   foo   bla
    1  1   off   aff
    _____
       0     1     2
    2  2   roo   laa
    3  2   jkl   xds
    _____
       0     1     2
    4  3   asd   fds
    5  3   qwe   tre
    6  3   tre   yxc
    _____