Search code examples
deep-learningpytorchcomputer-visiongoogle-colaboratory

Most efficient way to use a large image dataset with Google Colab -- getting drive timeout + memory errors


I'm using Google Colab to a classifier in PyTorch and my training dataset has ~30,000 jpg images which I have stored in my Google Drive. Efficiently using this large amount of data with Colab and Drive has been a nightmare, primarily because my Google Drive tends to crash or "time out" when I try to read images from a folder.

These are the two approaches I've tried so far, and both have failed.

  1. Read Images from directly Google Drive when __getitem__ is implemented, i.e. my torch Dataset object looks something like:
class Dataset(torch.utils.data.Dataset): 
    def __init__(self, image_ids, labels): 
        self.image_ids = image_ids
        self.labels = labels

    def __len__(self): 
        return len(self.image_ids)

    def __getitem__(self, i): 
        img_path = f'drive/MyDrive/images/{image_ids[i]}'
        img = transforms.ToTensor()(PIL.Image.open(img_path)
        label = self.labels[i]
        return img, label

Thus, when __getitem__ is called, it reads the image from the folder images in my Google Drive (where all 30000 images are stored). However, this creates the problem that when I create a DataLoader and loop over the minibatches, it gives me the error that a "Google Drive Timeout occurred", which from research seems to be what happens sometimes with large folders in Google Drive.

  1. Create a TensorDataset: To circumvent the above issue, I thought I would create a TensorDataset. To do this, I have to first create a massive tensor of all 30000 training images, i.e. of shape (30000, 3, 128, 128) (each image is 3x128x128), which takes a little time to do. So, I can't do it each time I run my code -- thus I try saving this large tensor, but this leads to memory issues in my Colab which crashes the Runtime. Plus, it's like 12GB so I'm sure this is not an efficient way.

Does anyone have suggestions on how to do this? The setup is very simple, but it's proving to be a bit annoying because Google Drive doesn't seem primed to do these things. I just have a folder of 30,000 images which I want to read as Torch tensors (in minibatches for training) into Colab. What's the best way of doing this, and/or how can I solve the problems in the approaches I discussed above?

My sense is that approach #1 is the most sensible, because it only reads into memory what needs to be read at a time. But for some reason reading things from a Google Drive folder which has many elements (this one has 30,000) leads to something called a "Google Drive timeout". The same process is trivial on my computer's CPU, but I need GPUs for training so I need to be able to do this on Colab. I don't know how to solve this.

For the record, I use Colab Pro so I have access to high-RAM runtimes.


Solution

  • I have 2 suggestions:

    1. Subfolder Strategy: Simply divide the data folder into subfolder, with certain naming convention and adapt your DataSet based on this convention. You can see this relevant link: google suggestion

    2. GCP - Object Storage Strategy: You can use use google cloud storage bucket without changing data format. Upload your data to google cloud bucket, give authorization to your colab environment and use GCP sdk to access your data in GCP. I suggest you use bucket since object storage is ideal for data with large number of files. This strategy might cause some overhead but it might not be that slow since you would use GCP (both operated by Google)

    Note: There is also option where you can mount GCP to your colab. I did not use this before

    Update: Small note: (also found in the link below) You will probably need to install some system packages for VM of Colab Relevant link for GCP and colab