Search code examples
pythontemporary-files

How can I make smaller temporary files?


So I am writing this program which creates a picture of the mandelbrot set, and I have been incrementally making it better. right now, each process that is spawned writes some data to a temporary file, which is used later on to put the picture together. Right now, however, the temporary files are ALOT bigger than the actual picture is itself, and I don't have have any ideas on how to make them smaller. How do I efficiently write integer data to a file, and get it back? I intend to eventual make this very scalable, so I would need to be able to write arbitrarily long integers for the pixel indices, but the color data is always going to be three integers that have a max value of 255. Here is my code:

import multiprocessing

def pixproc(y0, yn, xsteps, ysteps, fname):
    XMIN, YMIN = -2., -1.
    XLEN, YLEN = 3, 2
    with open(fname, 'w') as f:
        for y in xrange(y0, yn):
            print y
            for x in xrange(xsteps):
                c=complex(XMIN + XLEN*(1.*x/xsteps),
                          YMIN + YLEN*(1.*y/ysteps))
                k=c
                for i in xrange(256):
                    k = k*k + c
                    if abs(k)>2: break
                if 0<i<32:
                    #print 'Success!', i
                    print >>f, x, y, 8*i, 0, 0 #This is that part of
                if 32<=i<255:                  #my code that I am trying
                    #print 'Success!', i       #to improve. The rest of 
                    print >>f, x, y, 255, i, i #the code is given for context
    return                                     #and isn't relevant to my question


def main(xsteps, ysteps):
    pool = multiprocessing.Pool()
    n = multiprocessing.cpu_count()
    step = height / n
    fnames = ["temp" + str(i) for i in xrange(n)]
    for i in xrange(n):
        pool.apply_async(pixproc, 
                         (step*i, 
                          step*(i+1), 
                          xsteps, 
                          ysteps, 
                          fnames[i]))
    pool.close()
    pool.join()
    return fnames


if __name__=="__main__":
    from PIL import Image
    import sys
    width, height = map(int, sys.argv[1:])
    picname = "mandelbrot1.png"
    fnames = main(width, height)
    im = Image.new("RGB", (width, height))
    pp = im.load()
    for name in fnames:
        with open(name) as f:
            for line in f:
                line = map(int, line.rstrip('\n').split(' '))
                pp[line[0], line[1]] = line[2], line[3], line[4]
    im.save(picname)

When I try to make a picture that is 3000x2000, the actual picture is 672 KB, but the temporary files are both close to 30 MB! Can someone suggest a better way to store the data in files? (The important part is in the function pixproc)


Solution

  • Assuming you're just trying to eliminate the overhead of using a text-based format instead of a binary format for your temporary data, and you don't want to rewrite everything to use numpy, there are a few different solutions:


    First, you can keep the data in binary format in the first place: mmap the file, and use ctypes to treat it as a giant record of some kind. This is usually more trouble than it's worth, but it's worth mentioning.

    Assuming your data is nothing but a long list of tuples of 5 bytes:

    class Entry(ctypes.Structure):
        _fields_ = [("x", ctypes.c_uint8), ("y", ctypes.c_uint8),
                    ("i", ctypes.c_uint8), ("j", ctypes.c_uint8), ("k", ctypes.c_uint8)]
    Entries = ctypes.POINTER(Entry)
    with open(fname, 'wb') as f:
        f.truncate(ctypes.sizeof(Entry * (yn - y0)))
        m = mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)
    

    Second, you can use struct. You'll have to read the docs for complete details, but I'll give one example. Let's take this line:

    print >>f, x, y, 8*i, 0, 0
    

    Now, let's assume that all 5 of those are guaranteed to be bytes (0-255). You can just do:

    f.write(struct.pack('BBBBB', x, y, 8*i, 0, 0))
    

    To read them back later:

    x, y, i8, 0, 0 = struct.unpack('BBBBB', f.read(struct.calcsize('BBBBB')))
    i = i8//8
    

    If any of them needs to be longer than a byte, you need to deal with endianness, but that's pretty trivial. For example, if x and y range from -32768 to 32767:

    f.write(struct.pack('>hhBBB', x, y, 8*i, 0, 0))
    

    And make sure to open the file in binary mode.

    And you can of course combine this with mmap if you want, which means you can just use the struct.pack_into and struct.unpack_from instead of explicitly using pack plus write and unpack plus read.


    Next, there's pickle. Either directly create your list and just pickle.dump it, or manually pickle.dumps each entry and add some simple higher-level structure above that (or just use shelve, if that higher-level structure is, or could be, a simple mapping from keys to entries). This may be larger instead of smaller, and it may be slower, so you always want to do some testing before considering this. But sometimes it's a simple solution.


    Finally, you can probably come up with a more compact text format than just printing the str representation of each object. This is usually not worth the effort, but again, it's worth thinking about.