Search code examples
python-3.xpandasblazebcolzodo

Pandas / odo / bcolz selective loading of rows from a large CSV file


Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.

Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).

Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?


Solution

  • I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.

    The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.

    import pandas as pd
    
    pre_defined_list = ['foo', 'bar', 'baz']
    good_data = []
    for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
        chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
        good_data.append(chunk)
    
    df = pd.concat(good_data)
    

    Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.

    If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.