I am trying to use dataloader for training. The dataset is 150G, which are all .npz files. Due to the limitation of memory size, only one sample is read at a time from the disk. The following is part of the code.
class VimeoDataset(Dataset):
def __init__(self, mode, batch_size=32, num_workers = 8, num_gpus = 4):
self.batch_size = batch_size
self.num_workers = num_workers
self.num_gpus = num_gpus
self.mode = mode
self.load_data()
self.h = 256
self.w = 448
xx = np.arange(0, self.w).reshape(1,-1).repeat(self.h,0)
yy = np.arange(0, self.h).reshape(-1,1).repeat(self.w,1)
self.grid = np.stack((xx,yy),2).copy()
self.npzs=[]
count = self.batch_size * self.num_workers * self.num_gpus
if self.mode == 'train':
filelist = glob('/data/vimeoFlow2/dataset/train/*.npz')
self.npzs = [filelist[i:i + count] for i in range(0, len(filelist), count)]
else:
filelist = glob('/data/vimeoFlow2/dataset/val/*.npz')
self.npzs = [filelist[i:i + count] for i in range(0, len(filelist), count)]
def __len__(self):
return len(self.npzs)
def load_data(self, index):
self.data = []
self.flow_data = []
for i in range(len(self.npzs[index])):
f = np.load(self.npzs[index][i])
self.data.append(f['i0i1gt'])
if self.mode == 'train':
self.flow_data.append(f['ft0ft1'])
else:
self.flow_data.append(np.zeros((256, 448, 4)))
def getimg(self, index):
data = self.meta_data[index]
img0 = data[0:3].transpose(1, 2, 0)
img1 = data[3:6].transpose(1, 2, 0)
gt = data[6:9].transpose(1, 2, 0)
flow_gt = (self.flow_data[index]).transpose(1, 2, 0)
return img0, gt, img1, flow_gt
def __getitem__(self, index):
img0, gt, img1, flow_gt = self.getimg(index)
dataset = VimeoDataset(mode = 'train', batch_size=32, num_workers = 8, num_gpus = 4)
sampler = DistributedSampler(dataset)
train_data = DataLoader(dataset, batch_size=args.batch_size, pin_memory=True, num_workers=args.num_workers, drop_last=True, sampler=sampler)
dataset_val = VimeoDataset(mode = 'val', batch_size=32, num_workers = 8, num_gpus = 4)
val_data = DataLoader(dataset_val, batch_size=args.batch_size, pin_memory=True, num_workers=args.num_workers)
However, reading data from the disk one by one causes the dataloader to be very time-consuming. So I want to improve this program, first load the amount of data of num_gpus×num_workers×batch_size
into the memory, then read the data from the memory with __getitem__
, and finally replace the data in the memory after each iteration. But I still don’t know how to achieve it. I have tried my idea as in the code above. I don't konw how to allocate the load_data
function parameters.
It look like you are trying to use torch Dataset
in the wrong way. Your Dataset
subclass should neither batch the data itself nor use the number of workers.
Batching the data and loading it in parallel is the role of the DataLoader
class. You Dataset
subclass __getitem__
method should only returns 1 sample (and additionally one ground truth annotation) from the dataset, it should be data like Tensor
or Array
which can be concatenated in order to create a batch.
Take a look at the Dataset
and DataLoader
documentation that is pretty clear on this.
The purpose of DataLoader
is to load (i.e. read from disk to memory) and pre-process your data in parallel. If you specified 8 workers, it roughly means that 8 parallel threads are calling the __getitem__
method to create a batch of items. Note that DataLoader
already "caches" the data and load them in advance to be ready in time (take a look at the prefetch_factor
parameter).
This should be a sufficient compromise between loading speed and memory consumption, you should try this before writing any custom caching, loading and parallel processing of your data.