Search code examples
pythonpandaspickleconcatenationsparse-matrix

Speeding up Loading of Pandas Sparse DataFrame


I have a large pickled Sparse DataFrame that I generated, but since it was too big to hold in memory, I had to incrementally append as it was generated, as follows:

with open(data.pickle, 'ab') as output:
    pickle.dump(df.to_sparse(), output, pickle.HIGHEST_PROTOCOL)

Then in order to read the file back I do the following:

df_2 = pd.DataFrame([]).to_sparse()
with open(data.pickle, 'rb') as pickle_file:
    try:
        while True:
            test =  pickle.load(pickle_file)
            df_2 = pd.concat([df_2, test], ignore_index= True)
    except EOFError:
        pass

Given the size of this file(20 GB), this method works, but obviously takes a really really long time. Is it possible to parallelize the pickle.load/pd.concat steps for quicker loading time? Or are there any other suggestions for speeding this process up, specifically on the loading part of the code.

Note: The generation step is done on a computer with significantly less resources, so that's why the load step, which is done on a more powerful machine, can hold the DF in memory.

Thanks!


Solution

  • don't concat in a loop! This is a note in the docs, maybe should be a warning

    df_list = []
    with open(data.pickle, 'rb') as pickle_file:
        try:
            while True:
                test =  pickle.load(pickle_file)
                df_list.append(test)
        except EOFError:
            pass
    
    df_2 = pd.concat(df_list), ignore_index= True)
    

    You are making a copy of the frame EACH TIME in your loop now, and it is growing, so this is not efficient at all.

    The idiom is to append to a list, then do a single concat all at the end.

    Furthermore, you are going to be much better off writing to an HDF5 file in the data generation. Which is faster, and compressible. You can usually get away with writing a full df, unless its extremely sparse when you turn on compression.