Search code examples
pythonnumpymnistidx

Convert MNIST data from numpy arrays to original ubyte data


I used this code almost exactly, just changing the line:

f = gzip.open("../data/mnist.pkl.gz", 'rb')
training_data, validation_data, test_data = cPickle.load(f)

to these lines:

import pickle as cPickle
f = gzip.open("mnist.pkl.gz", 'rb')
u = cPickle._Unpickler(f)
u.encoding='latin1'
training_data, validation_data, test_data = u.load()

to account for pickling issues.The original mnist.pkl.gz was downloaded from his repo (available here), or the code to generate the .pkl.gz is here. The output is great, it's a pickled numpy array of the training and test data, and on inspection, I can see if I print the length of the training data, it's 250,000 numpy arrays.

I need to get the data back into the exact format as the original MNIST data (i.e. ubyte, training and testing data and labels separate) to be put into an external pipeline that i have no control over, so it must be the same as the original.

I'm really stuck on how to do this. I can see for example things like this that might help, but I can't see how it suits this problem. If someone could help me revert the output from this pickled numpy arrays to the original MNIST format (i.e. ubyte, training and testing data and labels separate), i'd really appreciate it.

Edit 1: Something I've just realised that might be easier, I actually only need to convert the training data into ubyte format, not the testing one, since I already have the testing data in ubyte format in the original.


Solution

  • Once you have the data in numpy arrays, you can convert the numpy arrays into mnist format refer this https://github.com/davidflanagan/notMNIST-to-MNIST/blob/17823f4d4a3acd8317c07866702d2eb2ac79c7a0/convert_to_mnist_format.py#L92

    You can read more the the mnist data format here http://yann.lecun.com/exdb/mnist/

    You can also verify your converted images from here https://stackoverflow.com/a/53181925