Search code examples
pythonpandascsvconcatenationlarge-files

Python Pandas: read_csv with chunksize and concat still throws MemoryError


I am trying to extract certain rows from a 10GB ~35mil rows csv file into a new csv based on condition (value of a column (Geography = Ontario)). It runs for a few minutes and I can see my free hard drive space getting drained from 14GB to basically zero and then get the MemoryError. I thought chunksize would help here but it did not :( Please advise.

import pandas as pd
df = pd.read_csv("Data.csv", chunksize = 10000)
result = pd.concat(df)
output=result[result['Geography']=='Ontario']
rowcount=len(output)
print(output)
print(rowcount)
output.to_csv('data2.csv')

Solution

  • You can try writing in chunks. Roughly:

    df = pd.read_csv("Data.csv", chunksize = 10000)
    header = True
    for chunk in df:
        chunk=chunk[chunk['Geography']=='Ontario']
        chunk.to_csv(outfilename, header=header, mode='a')
        header = False
    

    Idea from here.