I have a large csv file (7GB) and I used these codes to read it in Pandas:
chunks=pd.read_table('input_filename', chunksize=500000)
df=pd.DataFrame()
df=pd.concat((chunk==1) for chunk in chunks)
This works for me because the file is one-hot encoded, so the chunk==1
part convert 0s and 1s into boolean values, which saved me some memory usage.
Now I want to use this same method to read in another file, the only problem is that the new file has an ID
column, which is not one-hot encoded. My question is: how can I keep the ID
column intact and convert the rest columns in the same way?
I tried some subsetting techniques, including:
df=pd.concat((chunk.loc[:, -1]==1) for chunk in chunks)
but none of them worked so far.
Thanks!
Try this:
chunks = pd.read_csv('input_filename', chunksize=500000, index_col='ID')
df = pd.concat([chunk.astype(bool) for chunk in chunks]).reset_index()