Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.
Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).
Does odo
or Pandas facilitate methods for this type of selective loading of rows into a dataframe?
I don't know of anything in odo
or pandas
that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas
script that gets the job done.
The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
import pandas as pd
pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
good_data.append(chunk)
df = pd.concat(good_data)
Add/alter parameters for pd.read_csv
and pd.concat
as necessary for your specific situation.
If performance is an issue, you may be able to speed things up by using an alternative to .isin
, as described in this answer.