Search code examples

Re-create MNIST Dataset in Pytorch

I am newbie in Pytorch and in spite of quite a search, I am unable to grasp some concepts on datasets. Say I retrieve the MNIST dataset as follows

import torch
import torchvision
data =

took me a while to understand a DataLoader object is an iterable, so I can check the shape of one training batch with



torch.Size([128, 1, 28, 28])

So I gather 128 is the number of training rows (as per batch variable, 28*28 are pixels, and the second dimension is the label.

I also saw that the dataset is organised in such way that one could iterate like

for (x,y) in data:
   `    do something

but, FIRST QUESTION I cannot figure out where the (x,y) tuple is defined, so training data and labels, given that


returns what seems a single tensor, of shape (128,1,28,28). How does Dataloader know that the second dimension in that tensor is a label? And what if it were multi-dimensional?

Now to main difficulty. say for learning purposes I would like to recreate the same dataset from scratch, from a numpy array. I downloaded a .csv file with 59999 rows and 795 columns, the first containing the labels (column name "5"), the remaining the pixel values. I am not interested in labels for now, just the pixel values (the dataset is to be fed to an autoencoder such asthis one

I tried this

import pandas as pd
import numpy as np
data = pd.read_csv("mnist_train.csv")
labels = data["5"].values
datapoints = data.iloc[:,1:]

And then I tried

batch_size = 128
dataset_pytor = TensorDataset(torch.from_numpy(datapoints.values.reshape(-1,28,28)).unsqueeze(1))
my_loader = DataLoader(dataset_pytor, shuffle=True, batch_size=batch_size)

but it does not work, I get an error in the code I am using later, which boils down to this error

for x in my_loader:
    x =
AttributeError: 'list' object has no attribute 'to'

I cannot understand what is going on. If I run

for x in my_loader:

I get type lists.

If I run the same but for data , so the DataLoader defined above using the in-built MNIST dataset

for x in my_data:

I also get lists. Only if I do

for x,y in my_data:

then I get

class 'torch.Tensor'

Why is this? My question is then, how to recreate the MNIST dataset from a numpy array?


  • To comprehensively answer all your questions: 128,1,28,18 is the shape of your tensor, where 128 is the batch size, 1 is the dimension ( RGB will be 3 here ) and the 2 28’s are, as you rightly put it, the shape of the image. So, your first question is answered above – there is no issue where the dataloader takes the second dimension as the label. As to your second question, after you convert the csv file into a TensorDataset object, if you iterate through it, you get a tuple. The first element of each tuple is your tensor.

      for x in dataset_pytor:

    Try the above code with your dataset_pytor object

    An easy way to recreate MNIST is to create your own dataset object:

    I am assuming my first column in the dataframe contains my labels

      from import Dataset
      class MNIST(Dataset):
         def __init__(self,dataframe,transform=False):
             self.dataframe = dataframe
             self.transform = transform
             self.sample = self.dataframe.iloc[:,1:]
             self.label = self.dataframe["label"]
         def __len__(self):
             return len(self.label)
         def __getitem__(self,index):
             img_tensor = torch.from_numpy(self.dataframe.iloc[index,1:].values.reshape(-1,28,28))
             label_tensor = torch.from_numpy(np.array(self.label[index]))
             return img_tensor,label_tensor

    Your dataframe will be given as an input to create the above class.

      trainset = MNIST(data)
      #data is your dataframe
      my_loader = DataLoader(trainset, shuffle=True, batch_size=128)