I have a large csv file and want to read into a dataframe in pandas and perform operations. I need to do it in pandas, dask is not an option unfortunately. Therefore I process the csv in chunks. This works for example with a chunk size of 5000000:
import pandas as pd
import csv
import os
dtypes= { "Column1": str, "Column2": str, "Column3": str, "Column4": str,
}
output_path=r'C:\myfolder\test.csv'
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
for chunk in reader:
chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding="utf-8",
index=False,
sep=";",
decimal=",",
date_format="%d.%m.%Y",
quoting=csv.QUOTE_MINIMAL)
It takes a while, but works.
Now, I want to select a subset and export this subset to a new csv file. Suppose my original file looks like this:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10
Now I want to subselect those records, there the first two characters in column2 are "5C" and Column3 is equal to "B": The output should be:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10
Starting with one criteria I tried first:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
for chunk in reader:
filtered = chunk[chunk['Column3']=='B']
filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding="utf-8",
index=False,
sep=";",
decimal=",",
date_format="%d.%m.%Y",
quoting=csv.QUOTE_MINIMAL)
and it worked (with the same chunk size of 5000000).
However, I have now several criteria, so more than one. Starting to implement the second criteria I tried:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
for chunk in reader:
filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")]
filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding="utf-8",
index=False,
sep=";",
decimal=",",
date_format="%d.%m.%Y",
quoting=csv.QUOTE_MINIMAL)
And now comes the problem: It throws a memory error (MemoryError: Unable to allocate...
). The error does not occur with the first chunk, but later on. It seems that this needs way more RAM computation resources than just having one criteria. I was a bit confused why the first chunks are done without any problems and at the later chunks (more to the end) a memoryerror is thrown. Seems like these chunks have more data in it that comply with the filter criteria, so more data is stored and operated on.
I therefore reduced the chunk size to 2000000, same memory error again. I tried 1000000 and still a memory error. With 500000 it worked. Now I was not sure if this is "natural" that I have to go down with the chunk size, because with more criteria the computational resources increase, or if my way of connecting the criteria - the programming/implementation - this line: filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")]
is inefficient? So is there a more efficient way to subselect the data and output it? In this example I used 2 criteria, but I will have more, like up to 6. I would continue combining them with and or or operators to the filtered line.
I think you'll be happier doing this streaming preprocessing without Pandas at all, with just the built-in csv
module.
import csv
output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"
with open(output_path, "w", encoding="utf-8") as out_f:
cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
with open(input_path, encoding="utf-8") as in_f:
cr = csv.reader(in_f, delimiter=";")
cw.writerow(next(cr)) # copy header
for line in cr: # `line` is a tuple of strings
if line[2] == "B" and line[1][:2] == "5C":
cw.writerow(line)
You can then read the now-subset output_path
CSV file with Pandas if you need to.