Search code examples
pythoncompound-file

How can I treat a section of a file as though it's a file itself?


I have data stored in either a collection of files or in a single compound file. The compound file is formed by concatenating all the separate files, and then preceding everything with a header that gives the offsets and sizes of the constituent parts. I'd like to have a file-like object that presents a view of the compound file, where the view represents just one of the member files. (That way, I can have functions for reading the data that accept either a real file object or a "view" object, and they needn't worry about how any particular dataset is stored.) What library will do this for me?

The mmap class looked promising since it's constructed from a file, a length, and an offset, which is exactly what I have, but the offset needs to be aligned with the underlying file system's allocation granularity, and the files I'm reading don't meet that requirement. The name of the MultiFile class fits the bill, but it's tailored for attachments in e-mail messages, and my files don't have that structure.

The file operations I'm most interested in are read, seek, and tell. The files I'm reading are binary, so the text-oriented functions like readline and next aren't so crucial. I might eventually also need write, but I'm willing to forego that feature for now since I'm not sure how appending should behave.


Solution

  • I know you were searching for a library, but as soon as I read this question I thought I'd write my own. So here it is:

    import os
    
    class View:
        def __init__(self, f, offset, length):
            self.f = f
            self.f_offset = offset
            self.offset = 0
            self.length = length
    
        def seek(self, offset, whence=0):
            if whence == os.SEEK_SET:
                self.offset = offset
            elif whence == os.SEEK_CUR:
                self.offset += offset
            elif whence == os.SEEK_END:
                self.offset = self.length+offset
            else:
                # Other values of whence should raise an IOError
                return self.f.seek(offset, whence)
            return self.f.seek(self.offset+self.f_offset, os.SEEK_SET)
    
        def tell(self):
            return self.offset
    
        def read(self, size=-1):
            self.seek(self.offset)
            if size<0:
                size = self.length-self.offset
            size = max(0, min(size, self.length-self.offset))
            self.offset += size
            return self.f.read(size)
    
    if __name__ == "__main__":
        f = open('test.txt', 'r')
    
        views = []
        offsets = [i*11 for i in range(10)]
    
        for o in offsets:
            f.seek(o+1)
            length = int(f.read(1))
            views.append(View(f, o+2, length))
    
        f.seek(0)
    
        completes = {}
        for v in views:
            completes[v.f_offset] = v.read()
            v.seek(0)
    
        import collections
        strs = collections.defaultdict(str)
        for i in range(3):
            for v in views:
                strs[v.f_offset] += v.read(3)
        strs = dict(strs) # We want it to raise KeyErrors after that.
    
        for offset, s in completes.iteritems():
            print offset, strs[offset], completes[offset]
            assert strs[offset] == completes[offset], "Something went wrong!"
    

    And I wrote another script to generate the "test.txt" file:

    import string, random
    
    f = open('test.txt', 'w')
    
    for i in range(10):
        rand_list = list(string.ascii_letters)
        random.shuffle(rand_list)
        rand_str = "".join(rand_list[:9])
        f.write(".%d%s" % (len(rand_str), rand_str))
    

    It worked for me. The files I tested on are not binary files like yours, and they're not as big as yours, but this might be useful, I hope. If not, then thank you, that was a good challenge :D

    Also, I was wondering, if these are actually multiple files, why not use some kind of an archive file format, and use their libraries to read them?

    Hope it helps.