to give you context:
I have a large file f
, several Gigs in size. It contains consecutive pickles of different object that were generated by running
for obj in objs: cPickle.dump(obj, f)
I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize)
for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.
Another option that I have in mind is to dumps()
the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines()
and loads()
. But I fear that a pickled object may have the "\n"
character and it will throw off this file reading scheme. Is my fear unfounded ?
One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.
EDIT: I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.
file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:
import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
line = infile.readline()
if not line:
break
buf.append(line)
if line.endswith('.\n'):
print 'Decoding', buf
print pickle.loads(''.join(buf))
buf = []
If you have any control over the program that generates the pickles, I'd pick one of:
shelve
module.By the way, I suspect that the file
's built-in buffering should get you 99% of the performance gains you're looking for.
If you're convinced that I/O is blocking you, have you thought about trying mmap()
and letting the OS handle packing in blocks at a time?
#!/usr/bin/env python
import mmap
import cPickle
fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
end = m.find('.\n', start + 1) + 2
if end == 1:
break
print cPickle.loads(m[start:end])
start = end