Search code examples
pythonmachine-learningstructbuffermnist

unpack_from requires a buffer of at least 784 bytes


I'm running the following function for an ML model.

def get_images(filename):
    bin_file = open(filename, 'rb')
    buf = bin_file.read()  # all the file are put into memory
    bin_file.close()  # release the measure of operating system
    index = 0
    magic, num_images, num_rows, num_colums = struct.unpack_from(big_endian + four_bytes, buf, index)
    index += struct.calcsize(big_endian + four_bytes)
    images = []  # temp images as tuple
    for x in range(num_images):
        
        im = struct.unpack_from(big_endian + picture_bytes, buf, index)
        index += struct.calcsize(big_endian + picture_bytes)
        im = list(im)
        for i in range(len(im)):
            if im[i] > 1:
                im[i] = 1

However, I am receiving an error at the line:

im = struct.unpack_from(big_endian + picture_bytes, buf, index)

With the error:

error: unpack_from requires a buffer of at least 784 bytes

I have noticed this error is only occurring at certain iterations. I cannot figure out why this is might be the case. The dataset is a standard MNIST dataset which is freely available online.

I have also looked through similar questions on SO (e.g. error: unpack_from requires a buffer) but they don't seem to resolve the issue.


Solution

  • You didn't include the struct formats in your mre so it is hard to say why you are getting the error. Either you are using a partial/corrupted file or your struct formats are wrong.

    This answer uses the test file 't10k-images-idx3-ubyte.gz' and file formats found at http://yann.lecun.com/exdb/mnist/

    Open the file and read it into a bytes object (gzip is used because of the file's type).

    import gzip,struct
    with gzip.open(r'my\path\t10k-images-idx3-ubyte.gz','rb') as f:
        data = bytes(f.read())
    print(len(data))
    

    The file format spec says the header is 16 bytes (four 32 bit ints) - separate it from the pixels with a slice then unpack it

    hdr,pixels = data[:16],data[16:]
    magic, num_images, num_rows, num_cols = struct.unpack(">4L",hdr)
    # print(len(hdr),len(pixels))
    # print(magic, num_images, num_rows, num_cols)
    

    There are a number of ways to iterate over the individual images.

    img_size = num_rows * num_cols
    imgfmt = "B"*img_size
    for i in range(num_images):
        start = i * img_size
        end = start + img_size
        img = pixels[start:end]
        img = struct.unpack(imgfmt,img)
        # do work on the img
    

    Or...

    imgfmt = "B"*img_size
    for img in struct.iter_unpack(imgfmt, pixels):
        img = [p if p == 0 else 1 for p in img]
    

    The itertools grouper recipe would probably also work.