python machine-learning struct buffer mnist

unpack_from requires a buffer of at least 784 bytes

I'm running the following function for an ML model.

def get_images(filename):
    bin_file = open(filename, 'rb')
    buf = bin_file.read()  # all the file are put into memory
    bin_file.close()  # release the measure of operating system
    index = 0
    magic, num_images, num_rows, num_colums = struct.unpack_from(big_endian + four_bytes, buf, index)
    index += struct.calcsize(big_endian + four_bytes)
    images = []  # temp images as tuple
    for x in range(num_images):
        
        im = struct.unpack_from(big_endian + picture_bytes, buf, index)
        index += struct.calcsize(big_endian + picture_bytes)
        im = list(im)
        for i in range(len(im)):
            if im[i] > 1:
                im[i] = 1

However, I am receiving an error at the line:

im = struct.unpack_from(big_endian + picture_bytes, buf, index)

With the error:

error: unpack_from requires a buffer of at least 784 bytes

I have noticed this error is only occurring at certain iterations. I cannot figure out why this is might be the case. The dataset is a standard MNIST dataset which is freely available online.

I have also looked through similar questions on SO (e.g. error: unpack_from requires a buffer) but they don't seem to resolve the issue.

Solution

You didn't include the struct formats in your mre so it is hard to say why you are getting the error. Either you are using a partial/corrupted file or your struct formats are wrong.

This answer uses the test file 't10k-images-idx3-ubyte.gz' and file formats found at http://yann.lecun.com/exdb/mnist/

Open the file and read it into a bytes object (gzip is used because of the file's type).

import gzip,struct
with gzip.open(r'my\path\t10k-images-idx3-ubyte.gz','rb') as f:
    data = bytes(f.read())
print(len(data))

The file format spec says the header is 16 bytes (four 32 bit ints) - separate it from the pixels with a slice then unpack it

hdr,pixels = data[:16],data[16:]
magic, num_images, num_rows, num_cols = struct.unpack(">4L",hdr)
# print(len(hdr),len(pixels))
# print(magic, num_images, num_rows, num_cols)

There are a number of ways to iterate over the individual images.

img_size = num_rows * num_cols
imgfmt = "B"*img_size
for i in range(num_images):
    start = i * img_size
    end = start + img_size
    img = pixels[start:end]
    img = struct.unpack(imgfmt,img)
    # do work on the img

Or...

imgfmt = "B"*img_size
for img in struct.iter_unpack(imgfmt, pixels):
    img = [p if p == 0 else 1 for p in img]

The itertools grouper recipe would probably also work.