Search code examples
pythonmnist

Reading MNIST by byte with python


I am playing around with the MNIST dataset and I have encountered the following, which I don't quite understand. According to the documentation the data is formatted as follows:

[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000801(2049) magic number (MSB first) 
0004     32 bit integer  60000            number of items 
0008     unsigned byte   ??               label 
0009     unsigned byte   ??               label 
........ 
xxxx     unsigned byte   ??               label
The labels values are 0 to 9.

Thus, I would expect bytes 4-8, corresponding to the number of items (60,000) to be:

struct.pack('i', 60000)
>> '`\xea\x00\x00'

However, when I read the file byte-by-byte, it looks like they are in reverse order:

with gzip.open(path_to_file, 'rb') as f:
    print struct.unpack('cccc', f.read(4))
    for i in range(4):
        print struct.unpack('c', f.read(1))
>> ('\x00', '\x00', '\x08', '\x01')
>> ('\x00', '\x00', '\xea', '`')

Clearly, I can reverse them to get the expected order, but I am confused as to why the bytes seem to reversed.


Solution

  • This is an artifact of byte ordering within a word. The data is formatted as an integer, so you'r esupposed to read it that way. This is "little-endian" addressing, the lowest (earliest) address having the least significant byte. Note that in the first field, the format specified is "MSB first".