Examples or explanations of pytorch dataloaders?

I am fairly new to Pytorch (and have never done advanced coding). I am trying to learn the basics of deep learning using the d2l.ai textbook but am having trouble with understanding the logic behind the code for dataloaders. I read the torch.utils.data docs and am not sure what the DataLoader class is meant for, and when for example I am supposed to use the torch.utils.data.TensorDataset class in combination with it. For example, d2l defines a function:

def load_array(data_arrays, batch_size, is_train=True):
    """Construct a PyTorch data iterator."""
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

I assume this is supposed to return an iterable that iterates over different batches. However, I don't understand what the data.TensorDataset part does (seems like there are a lot of options listed on the docs page). Also, the documents say that there are two types of datasets: iterable and map style. When describing the former type, it says

"This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data."

What does it mean for "a random read to be expensive or improbable" and for the batch_size to depend on the fetched data? Can anyone give an example of this?

If there is any source where a CompSci noob like me can learn these basics, I'd really appreciate tips!

Thanks very much!

Solution

I'll give you an example of how to use dataloaders and will explain the steps:

Dataloaders are iterables over the dataset. So when you iterate over it, it will return B randomly from the dataset collected samples (including the data-sample and the target/label), where B is the batch-size.

To create such a dataloader you will first need a class which inherits from the Dataset Pytorch class. There is a standard implementation of this class in pytorch which should be TensorDataset. But the standard way is to create an own one. Here is an example for image classification:

import torch
from PIL import Image


class YourImageDataset(torch.utils.data.Dataset):
    def __init__(self, image_folder):
        self.image_folder = image_folder
        self.images = os.listdir(image_folder)

    # get sample
    def __getitem__(self, idx):
        image_file = self.images[idx]

        image = Image.open((self.image_folder + image_file))
        image = np.array(image)
        
        # normalize image
        image = image / 255

        # convert to tensor
        image = torch.Tensor(image).reshape(3, 512, 512)
        
        # get the label, in this case the label was noted in the name of the image file, ie: 1_image_28457.png where 1 is the label and the number at the end is just the id or something
        target = int(image_file.split("_")[0])
        target = torch.Tensor(target)

        return image, target

    def __len__(self):
        return len(self.images)

To get an example image you can call the class and pass some random index into the getitem function. It will then return the tensor of the image matrix and the tensor of the label at that index. For example:

dataset = YourImageDataset("/path/to/image/folder")
data, sample = dataset.__getitem__(0) # get data at index 0

Alright, so now you have created the class which preprocesses and returns ONE sample and its label. Now we have to create the datalaoder, which "wraps" around this class and then can return whole batches of samples from your dataset class. Lets create three dataloaders, one which iterates over the train set, one for the test set and one for the validation set:

dataset = YourImageDataset("/path/to/image/folder")

# lets split the dataset into three parts (train 70%, test 15%, validation 15%)
test_size = 0.15
val_size = 0.15

test_amount, val_amount = int(dataset.__len__() * test_size), int(dataset.__len__() * val_size)

# this function will automatically randomly split your dataset but you could also implement the split yourself
train_set, val_set, test_set = torch.utils.data.random_split(dataset, [
            (dataset.__len__() - (test_amount + val_amount)), 
            test_amount, 
            val_amount
])


# B is your batch-size, ie. 128

train_dataloader = torch.utils.data.DataLoader(
            train_set,
            batch_size=B,
            shuffle=True,
)
val_dataloader = torch.utils.data.DataLoader(
            val_set,
            batch_size=B,
            shuffle=True,
)
test_dataloader = torch.utils.data.DataLoader(
            test_set,
            batch_size=B,
            shuffle=True,
)

Now you have created your dataloaders and are ready to train! For example like this:


for epoch in range(epochs):

    for images, targets in train_dataloader:
        # now 'images' is a batch containing B samples (shape: B x image_height x image_width)
        # and 'targets' is a batch containing B targets (of the images in 'images' with the same index)

        optimizer.zero_grad()
        images, targets = images.cuda(), targets.cuda()
        predictions = model.train()(images)
        
        . . .

Normally you would create an own file for the "YourImageDataset" class and then import to the file in which you want to create the dataloaders. I hope I could make clear what the role of the dataloader and the Dataset class is and how to use them!

I don't know much about iter-style datasets but from what I understood: The method I showed you above, is the map-style. You use that, if your dataset is stored in a .csv, .json or whatever kind of file. So you can iterate through all rows or entries of the dataset. Iter-style will take you dataset or a part of the dataset and will convert in to an iterable. For example, if your dataset is a list, this is what an iterable of the list would look like:

dataset = [1,2,3,4]
dataset  = iter(dataset)

print(next(a))
print(next(a))
print(next(a))
print(next(a))

# output:
# >>> 1
# >>> 2
# >>> 3
# >>> 4

So the next will give you the next item of the list. Using this together with a Pytorch Dataloader is probably more efficient and faster. Normally the map-dataloader is fast enough and common to use, but the documentation supposed that when you are loading data-batches from a database (which can be slower) then iter-style dataset would be more efficient. This explanation of iter-style is a bit vague but I hope it makes you understand what I understood. I would recommend you to use the map-style first, as I explained it in my original answer.