Search code examples
pythonpytorchtensorpytorch-dataloader

Problem loading parallel datasets even after using SubsetRandomSampler


I have two parallel datasets dataset1 and dataset2 and following is my code to load them in parallel using SubsetRandomSampler where I provide train_indices for dataloading.

P.S. Even after setting num_workers=0 and seeding np as well as torch, the samples do not get loaded in parallel. Any suggestions are heartily welcome including methods other than SubsetRandomSampler.

import torch, numpy as np
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler

dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

train_indices = list(range(len(dataset1)))
torch.manual_seed(12)
np.random.seed(12)
np.random.shuffle(train_indices)
sampler = SubsetRandomSampler(train_indices)

dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)

for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
  x = data1
  y = data2
  print(x, y)

Output:

tensor([5, 1]) tensor([15, 18])
tensor([0, 2]) tensor([14, 12])
tensor([4, 6]) tensor([16, 10])
tensor([8, 9]) tensor([11, 19])
tensor([7, 3]) tensor([17, 13])

Expected Output:

tensor([5, 1]) tensor([15, 11])
tensor([0, 2]) tensor([10, 12])
tensor([4, 6]) tensor([14, 16])
tensor([8, 9]) tensor([18, 19])
tensor([7, 3]) tensor([17, 13])

Solution

  • Since I was using a random sampler, the random indices are expected. To yield the same (shuffled) indices from both DataLoaders, it is better to create the indices first, and then use a custom sampler:

    class MySampler(torch.utils.data.sampler.Sampler):
        def __init__(self, indices):
            self.indices = indices
            
        def __iter__(self):
            return iter(self.indices)
        
        def __len__(self):
            return len(self.indices)
    
    
    dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
    
    train_indices = list(range(len(dataset1)))
    np.random.seed(12)
    np.random.shuffle(train_indices)
    
    sampler = MySampler(train_indices)
    
    dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
    dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)
    
    for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
      x = data1
      y = data2
      print(x, y)
    

    P.S. got the solution by cross-posting on Pytorch forums but still want to keep it for future readers. Credits to ptrblck.