Search code examples
deep-learningpytorchnumpy-memmap

Compare the efficiency of the data loading methods in deep learning


I need to load a time-series dataset to train a network. The dataset was split into many chunks train_x_0.npy, train_x_1.npy, ..., train_x_40.npy (41 chunks) because of memory issue when I extract these .npy files from the raw data. However, their sizes are too large (around 1000 GB) that I couldn't load everything into the RAM. I have been considering two ways to solve this problem.

  1. Loading the data chunks using np.load() with argument mmap_mode='r+'. The memory-mapped chunks are stored in a Python list self.data. In the __getitem__(self, idx) method of Pytorch Dataset class, I convert idx to chunk_idx and sample_idx, then get the sample by self.data[chunk_idx][sample_idx].
  2. Extract .npy files again from raw data, and save the data sample-by-sample, i.e. one .npy file is now one sample, not a data chunk. In the __getitem__(self, idx) method, I will get one sample by loading it using np.load(sample_path).

Assuming the Pytorch DataLoader will be used to iterate through all samples, then which method will be faster?

If you have another suggestion to extract the raw data or to load the .npy files, please share your opinion.


Solution

  • Both suggested approaches will be limited by your filesystem's IO, since each sample will be loaded from disk on-demand (memory mapping does not speed up the actual loading, once a given patch is requested).

    Especially when you are planning to train for many epochs, you can achieve a strong speedup by loading your original chunks train_x_0.npy, train_x_1.npy, etc. one (or as many as you can hold in RAM) at a time and training multiple epochs on this chunk before switching to the next.

    For this, you would need control over the sample indices requested by the dataloader. For that you could define a sampler which is passed the sample indices available in the respective cached data chunk. In pseudocode, your training loop could look something like this when caching one chunk at a time:

    from yourproject import Dataset
    from torch.utils.data import DataLoader
    from torch.utils.data.sampler import SubsetRandomSampler
    
    dataset = Dataset(train_data_path, ...)
    for chunk_idx in range(num_chunks):
      dataset.cache_chunk(chunk_idx)
      chunk_sample_inds = dataset.get_chunk_sample_inds(chunk_idx)
      chunk_sampler = SubsetRandomSampler(chunk_sample_inds)
      chunk_loader = DataLoader(dataset=dataset, sampler=chunk_sampler)
      for chunk_epoch in range(num_chunk_epoch):
        for sample_idx, sample in enumerate(chunk_loader):
           output = model(sample)
    

    Hereby, your Dataset class needs to take care of

    • caching (loading to RAM) a specified chunk, given a chunk idx (indicated by the cache_chunk method)
    • returning a list of valid sample indices for a given chunk idx (indicated by the get_chunk_sample_inds method)

    If you use a fast GPU (which is often limited by shuffling data back and forth between RAM and VRAM, even for RAM-cached data), you can expect several orders of magnitude speed up using this approach (as opposed to attempting to fill the VRAM from HDD for each sample).