Search code examples
pythonnewlinepicklebuffering

Reading multiple Python pickled data at once, buffering and newlines?


to give you context:

I have a large file f, several Gigs in size. It contains consecutive pickles of different object that were generated by running

for obj in objs: cPickle.dump(obj, f)

I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize) for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.

Another option that I have in mind is to dumps() the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines() and loads(). But I fear that a pickled object may have the "\n" character and it will throw off this file reading scheme. Is my fear unfounded ?

One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.

EDIT: I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.


Solution

  • file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:

    import pickle
    infile = open('/tmp/pickle', 'rb')
    buf = []
    while True:
        line = infile.readline()
        if not line:
            break
        buf.append(line)
        if line.endswith('.\n'):
            print 'Decoding', buf
            print pickle.loads(''.join(buf))
            buf = []
    

    If you have any control over the program that generates the pickles, I'd pick one of:

    1. Use the shelve module.
    2. Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.
    3. Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.
    4. Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.

    By the way, I suspect that the file's built-in buffering should get you 99% of the performance gains you're looking for.

    If you're convinced that I/O is blocking you, have you thought about trying mmap() and letting the OS handle packing in blocks at a time?

    #!/usr/bin/env python
    
    import mmap
    import cPickle
    
    fname = '/tmp/pickle'
    infile = open(fname, 'rb')
    m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
    start = 0
    while True:
        end = m.find('.\n', start + 1) + 2
        if end == 1:
            break
        print cPickle.loads(m[start:end])
        start = end