Search code examples
pythonpandasramchunks

Reading and performing data filtering with pandas chunks: Efficient filtering on chunks with multiple criteria avoiding memoryerror?


I have a large csv file and want to read into a dataframe in pandas and perform operations. I need to do it in pandas, dask is not an option unfortunately. Therefore I process the csv in chunks. This works for example with a chunk size of 5000000:

import pandas as pd
import csv
import os

dtypes= {     "Column1": str, "Column2": str, "Column3": str, "Column4": str,
        }

output_path=r'C:\myfolder\test.csv'

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)

It takes a while, but works.

Now, I want to select a subset and export this subset to a new csv file. Suppose my original file looks like this:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10

Now I want to subselect those records, there the first two characters in column2 are "5C" and Column3 is equal to "B": The output should be:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10

Starting with one criteria I tried first:

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[chunk['Column3']=='B']
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)

and it worked (with the same chunk size of 5000000).

However, I have now several criteria, so more than one. Starting to implement the second criteria I tried:

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")]
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)

And now comes the problem: It throws a memory error (MemoryError: Unable to allocate...). The error does not occur with the first chunk, but later on. It seems that this needs way more RAM computation resources than just having one criteria. I was a bit confused why the first chunks are done without any problems and at the later chunks (more to the end) a memoryerror is thrown. Seems like these chunks have more data in it that comply with the filter criteria, so more data is stored and operated on.

I therefore reduced the chunk size to 2000000, same memory error again. I tried 1000000 and still a memory error. With 500000 it worked. Now I was not sure if this is "natural" that I have to go down with the chunk size, because with more criteria the computational resources increase, or if my way of connecting the criteria - the programming/implementation - this line: filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")] is inefficient? So is there a more efficient way to subselect the data and output it? In this example I used 2 criteria, but I will have more, like up to 6. I would continue combining them with and or or operators to the filtered line.


Solution

  • I think you'll be happier doing this streaming preprocessing without Pandas at all, with just the built-in csv module.

    import csv
    
    output_path = r"C:\myfolder\test.csv"
    input_path = r"C:\myfolder\large_file.csv"
    
    with open(output_path, "w", encoding="utf-8") as out_f:
        cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
        with open(input_path, encoding="utf-8") as in_f:
            cr = csv.reader(in_f, delimiter=";")
            cw.writerow(next(cr))  # copy header
            for line in cr:  # `line` is a tuple of strings
                if line[2] == "B" and line[1][:2] == "5C":
                    cw.writerow(line)
    

    You can then read the now-subset output_path CSV file with Pandas if you need to.