I have two parallel datasets dataset1
and dataset2
and following is my code to load them in parallel using SubsetRandomSampler
where I provide train_indices
for dataloading.
P.S. Even after setting num_workers=0
and seeding np
as well as torch
, the samples do not get loaded in parallel. Any suggestions are heartily welcome including methods other than SubsetRandomSampler
.
import torch, numpy as np
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
train_indices = list(range(len(dataset1)))
torch.manual_seed(12)
np.random.seed(12)
np.random.shuffle(train_indices)
sampler = SubsetRandomSampler(train_indices)
dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)
for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
x = data1
y = data2
print(x, y)
Output:
tensor([5, 1]) tensor([15, 18])
tensor([0, 2]) tensor([14, 12])
tensor([4, 6]) tensor([16, 10])
tensor([8, 9]) tensor([11, 19])
tensor([7, 3]) tensor([17, 13])
Expected Output:
tensor([5, 1]) tensor([15, 11])
tensor([0, 2]) tensor([10, 12])
tensor([4, 6]) tensor([14, 16])
tensor([8, 9]) tensor([18, 19])
tensor([7, 3]) tensor([17, 13])
Since I was using a random sampler, the random indices are expected. To yield the same (shuffled) indices from both DataLoaders, it is better to create the indices first, and then use a custom sampler:
class MySampler(torch.utils.data.sampler.Sampler):
def __init__(self, indices):
self.indices = indices
def __iter__(self):
return iter(self.indices)
def __len__(self):
return len(self.indices)
dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
train_indices = list(range(len(dataset1)))
np.random.seed(12)
np.random.shuffle(train_indices)
sampler = MySampler(train_indices)
dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)
for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
x = data1
y = data2
print(x, y)
P.S. got the solution by cross-posting on Pytorch forums but still want to keep it for future readers. Credits to ptrblck.