I need to load a time-series dataset to train a network. The dataset was split into many chunks train_x_0.npy
, train_x_1.npy
, ..., train_x_40.npy
(41 chunks) because of memory issue when I extract these .npy
files from the raw data. However, their sizes are too large (around 1000 GB) that I couldn't load everything into the RAM. I have been considering two ways to solve this problem.
np.load()
with argument mmap_mode='r+'
. The memory-mapped chunks are stored in a Python list self.data
. In the __getitem__(self, idx)
method of Pytorch Dataset
class, I convert idx
to chunk_idx
and sample_idx
, then get the sample by self.data[chunk_idx][sample_idx]
..npy
files again from raw data, and save the data sample-by-sample, i.e. one .npy
file is now one sample, not a data chunk. In the __getitem__(self, idx)
method, I will get one sample by loading it using np.load(sample_path)
.Assuming the Pytorch DataLoader
will be used to iterate through all samples, then which method will be faster?
If you have another suggestion to extract the raw data or to load the .npy
files, please share your opinion.
Both suggested approaches will be limited by your filesystem's IO, since each sample will be loaded from disk on-demand (memory mapping does not speed up the actual loading, once a given patch is requested).
Especially when you are planning to train for many epochs, you can achieve a strong speedup by loading your original chunks train_x_0.npy
, train_x_1.npy
, etc. one (or as many as you can hold in RAM) at a time and training multiple epochs on this chunk before switching to the next.
For this, you would need control over the sample indices requested by the dataloader
. For that you could define a sampler which is passed the sample indices available in the respective cached data chunk. In pseudocode, your training loop could look something like this when caching one chunk at a time:
from yourproject import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
dataset = Dataset(train_data_path, ...)
for chunk_idx in range(num_chunks):
dataset.cache_chunk(chunk_idx)
chunk_sample_inds = dataset.get_chunk_sample_inds(chunk_idx)
chunk_sampler = SubsetRandomSampler(chunk_sample_inds)
chunk_loader = DataLoader(dataset=dataset, sampler=chunk_sampler)
for chunk_epoch in range(num_chunk_epoch):
for sample_idx, sample in enumerate(chunk_loader):
output = model(sample)
Hereby, your Dataset
class needs to take care of
cache_chunk
method)get_chunk_sample_inds
method)If you use a fast GPU (which is often limited by shuffling data back and forth between RAM and VRAM, even for RAM-cached data), you can expect several orders of magnitude speed up using this approach (as opposed to attempting to fill the VRAM from HDD for each sample).