Search code examples
pythonpandascsviteratorrefactoring

Refactoring pandas using an iterator via chunksize


I am looking for advice on using a pandas iterator.

I performed a parsing operation using Python pandas, the size of the input files (a bioinformatics program called eggNOG) is resulting in 'RAM bottleneck' phenomenon. It's just not processing the file.

The obvious solution is to shift to an iterator, which for pandas is the chunksize option

import pandas as pd
import numpy as np

df = pd.read_csv('myinfile.csv', sep="\t", chunksize=100)

Whats changed with the original code is the chunksize=100 bit, forcing an iterator.

The next step is just to perform a simple operation, dropping a few columns and moving all '-' characters to np.nan then writing the whole file.

df.drop(['score', 'evalue', 'Description', 'EC', 'PFAMs'],axis=1).replace('-', np.nan)
df.to_csv('my.csv',sep='\t',index=False)

How is this done under a pandas iterator?


Update

The solution is described in answers below and comprised two components:

  1. Don't load junk at point source: I uploaded and deleted lots of junk (not good)
  2. Leveraging open outside the chunking loop. This forces the outfield to remain open for each chunk to be written and shuts on the last chunk.

The outfile comprised duplicates. Its inevitable because they split across different chunks and these removed, i.e. reduced, via

df = df.groupby(['compoundIndex'])['Frequency'].sum().to_frame()

This resulted in the outfile being identical to the non-iterator method and by manipulating chunksize any level of "RAM bottleneck" could be overcome. The actual code is an OO module - reasonably complex parsing - and the codes below fitted straight in.

Cool.


Solution

  • IIUC, you can do:

    cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
    data = []
    for chunk in pd.read_csv('myinfile.csv', sep='\t', na_values='-', chunksize=100):
        chunk = chunk.drop(columns=cols_to_drop)
        data.append(chunk)
    pd.concat(data).to_csv('my.csv', sep='\t', index=False)
    

    If you know the columns you want to keep instead of which ones you want to drop, use:

    cols_to_keep = ['col1', 'col2', 'col3']
    data = []
    for chunk in pd.read_csv('myinfile.csv', usecols=cols_to_keep, usesep='\t', na_values='-', chunksize=100):
        data.append(chunk)
    pd.concat(data).to_csv('my.csv', sep='\t', index=False)
    

    Alternative inspired by @el_oso:

    cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
    with (open('myinfile.csv') as inp,
          open('my.csv', 'w') as out):
        headers = inp.readline().split('\t')
        out.write('\t'.join([col for col in headers if col not in cols_to_drop]))
        for chunk in pd.read_csv(inp, header=None, names=headers, sep='\t', na_values='-', chunksize=100):
            chunk = chunk.drop(columns=cols_to_drop)
            chunk.to_csv(out, sep='\t', index=False, header=False)