Search code examples
machine-learningdatasetcomputer-visionmnist

Why InfMnist (MNIST) size of 8M examples is calculated as 8 109 999 examples?


On the website : http://leon.bottou.org/projects/infimnist

It says :

Generating files containing the MNIST8M training set: $ infimnist lab 10000 8109999 > mnist8m-labels-idx1-ubyte $ infimnist pat 10000 8109999 > mnist8m-patterns-idx3-ubyte

However, i fail to see why its from 10 000 to 8 109 999 Even if i do : 8 109 999 - 10 000 , it still doesnt make sense to me.

To me 8M would be 8 000 000 + 9 999 because i would end at 9 999 and start from 10 000 to 8 009 999 , which would be 8 million images.

Does anyone understand why its calculated as 8 109 999 ?


Solution

  • According to a fellow kaggle user, this is why :

    The 8M dataset is the original images + 134 distortions/original. So there are

    135*60,000 = 8,100,000

    training images.

    Adding the 10,000 test images you get 8,110,000 images.

    The test images are from index 0 to 10,000-1=9,999 and the training images are from index 10,000 to 8,110,000-1 = 8,109,999.

    I hope this helps.

    The original dataset is also here:

    https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html

    You can see that "# of data: 8,100,000"