Search code examples
pythonpython-multiprocessingpython-multithreading

Speed up reading multiple pickle files


I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.

Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.

Using threading wouldn't help either because of GIL.

I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

UPDATE Answering your questions:

  • Files are partial products of data processing for the purpose of ML
  • There are pandas.Series objects but the dtype is not known upfront
  • I want to have many files because we want to pick any subset easily
  • I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
  • The size of the files can vary a lot
  • I use python 3.7 so I believe it's cPickle in fact
  • Using pickle is very flexible because I don't have to worry about underlying types - I can save anything

Solution

  • I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

    In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).

    If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.

    Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil

    import pickle, shutil, os
    
    #some dummy data
    d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
    d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}
    
    #create two pickles
    with open('test1.pickle', 'wb') as f:
        pickle.Pickler(f).dump(d1)
    with open('test2.pickle', 'wb') as f:
        pickle.Pickler(f).dump(d2)
        
    #combine list of pickle files
    with open('test3.pickle', 'wb') as dst:
        for pickle_file in ['test1.pickle', 'test2.pickle']:
            with open(pickle_file, 'rb') as src:
                shutil.copyfileobj(src, dst)
                
    #unpack the data
    with open('test3.pickle', 'rb') as f:
        p = pickle.Unpickler(f)
        while True:
            try:
                print(p.load())
            except EOFError:
                break
            
    #cleanup
    os.remove('test1.pickle')
    os.remove('test2.pickle')
    os.remove('test3.pickle')