Search code examples
pythonobjectsplitpicklelarge-data

How to split a large generic Python object into chunks


I have a large generic Python object which I have no information about. I need to split this object into smaller chunks for storage needs.

Hope someone can help, Omer.


Solution

  • pickle it and split the resulting data.

    You can't serialize only "a part" of an object because there's no such thing in the general case as "a part of an object", you need knowledge of its internals to be able to split it into parts -- which you stated you don't have.

    However, you can use pickle.dump (that writes to a file-like object) and pass it a custom file-like object that would split the resulting data instead as it receives it.

    E.g. here's a file-like object that would write data to files in 2GiB chunks (in the example, I set the chunk size to 4MiB instead):

    class SplitFile(object):
        def __init__(self, name_pattern, chunk_size=2*1024**3):
            self.name_pattern = name_pattern
            self.chunk_size = chunk_size
            self.file = None
            self.part = -1
            self.offset = None
    
        def write(self, bytes):
            if not self.file:  self._split()
            while True:
                l = len(bytes)
                wl = min(l, self.chunk_size - self.offset)
                self.file.write(bytes[:wl])
                self.offset += wl
                if wl == l: break
                self._split()
                bytes = bytes[wl:]
    
        def _split(self):
            if self.file:  self.file.close()
            self.part += 1
            self.file = open(self.name_pattern % self.part, "wb")
            self.offset = 0
    
        def close(self):
            if self.file:  self.file.close()
    
        def __del__(self):
            self.close()
    
    import random
    big_object = [random.random() for _ in range(1000000)]
    import pickle
    dest = SplitFile("data.part%02d.pickle", 4*1024**2)
    pickle.dump(big_object, dest)
    

    After running the example, we have:

    $ ls -l *.pickle
    -rwxrwx---+ 1 Sasha None 4194304 Dec  4 16:02 data.part00.pickle
    -rwxrwx---+ 1 Sasha None 4194304 Dec  4 16:02 data.part01.pickle
    -rwxrwx---+ 1 Sasha None 4194304 Dec  4 16:02 data.part02.pickle
    -rwxrwx---+ 1 Sasha None 4194304 Dec  4 16:02 data.part03.pickle
    -rwxrwx---+ 1 Sasha None 4194304 Dec  4 16:02 data.part04.pickle
    -rwxrwx---+ 1 Sasha None  294912 Dec  4 16:02 data.part05.pickle