Search code examples
pythonpandascsvchunks

how to process all but one column before concatenating chunks when use chunks to read large csv file


I have a large csv file (7GB) and I used these codes to read it in Pandas:

chunks=pd.read_table('input_filename', chunksize=500000)
df=pd.DataFrame()
df=pd.concat((chunk==1) for chunk in chunks)

This works for me because the file is one-hot encoded, so the chunk==1 part convert 0s and 1s into boolean values, which saved me some memory usage.

Now I want to use this same method to read in another file, the only problem is that the new file has an ID column, which is not one-hot encoded. My question is: how can I keep the ID column intact and convert the rest columns in the same way?

I tried some subsetting techniques, including:

df=pd.concat((chunk.loc[:, -1]==1) for chunk in chunks)

but none of them worked so far.

Thanks!


Solution

  • Try this:

    chunks = pd.read_csv('input_filename', chunksize=500000, index_col='ID')
    df = pd.concat([chunk.astype(bool) for chunk in chunks]).reset_index()