Search code examples
pythonpytorchpytorch-dataloader

The shuffling order of DataLoader in pytorch


I am really confused about the shuffle order of DataLoader in pytorch. Supposed I have a dataset:

datasets = [0,1,2,3,4]

In scenario I, the code is:

torch.manual_seed(1)

G = torch.Generator()
G.manual_seed(1)

ran_sampler = RandomSampler(data_source=datasets,generator=G)
dataloader = DataLoader(dataset=datasets,sampler=ran_sampler)

the shuffling result is 0,4,2,3,1.


In scenario II, the code is:

torch.manual_seed(1)

G = torch.Generator()
G.manual_seed(1)

ran_sampler = RandomSampler(data_source=datasets)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)

the shuffling result is 1,3,4,0,2.


In scenario III, the code is:

torch.manual_seed(1)

G = torch.Generator()
G.manual_seed(1)

ran_sampler = RandomSampler(data_source=datasets, generator=G)
dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)

the shuffling result is 4,1,3,0,2.

Can someone explain what is going on here?


Solution

  • Based on your code, I did a little modification (on scenario II) and inspection:

    datasets = [0,1,2,3,4]
    
    torch.manual_seed(1)
    G = torch.Generator()
    G = G.manual_seed(1)
    
    ran_sampler = RandomSampler(data_source=datasets, generator=G)
    dataloader = DataLoader(dataset=datasets, sampler=ran_sampler)
    print(id(dataloader.generator)==id(dataloader.sampler.generator))
    xs = []
    for x in dataloader:
        xs.append(x.item())
    print(xs)
    
    torch.manual_seed(1)
    G = torch.Generator()
    G.manual_seed(1)
    
    # this is different from OP's scenario II because in that case the ran_sampler is not initialized with the right generator.
    dataloader = DataLoader(dataset=datasets, shuffle=True, generator=G)
    print(id(dataloader.generator)==id(dataloader.sampler.generator))
    xs = []
    for x in dataloader:
        xs.append(x.item())
    print(xs)
    
    torch.manual_seed(1)
    G = torch.Generator()
    G.manual_seed(1)
    
    
    ran_sampler = RandomSampler(data_source=datasets, generator=G)
    dataloader = DataLoader(dataset=datasets, sampler=ran_sampler, generator=G)
    print(id(dataloader.generator)==id(dataloader.sampler.generator))
    xs = []
    for x in dataloader:
        xs.append(x.item())
    print(xs)
    

    The outputs are:

    False
    [0, 4, 2, 3, 1]
    True
    [4, 1, 3, 0, 2]
    True
    [4, 1, 3, 0, 2]
    

    The reason why the above three seemingly equivalent setups lead to different outcomes is that there are two different generators actually being used inside the DataLoader, one of which is None, in the first scenario.

    To make it clear, let's analyze the source. It seems that the generator not only decides the random number generation of the _index_sampler inside DataLoader but also affects the initialization of _BaseDataLoaderIter. See the source code

            if sampler is None:  # give default samplers
                if self._dataset_kind == _DatasetKind.Iterable:
                    # See NOTE [ Custom Samplers and IterableDataset ]
                    sampler = _InfiniteConstantSampler()
                else:  # map-style
                    if shuffle:
                        sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
                    else:
                        sampler = SequentialSampler(dataset)  # type: ignore[arg-type]
    

    and

            self.sampler = sampler
            self.batch_sampler = batch_sampler
            self.generator = generator
    

    and

        def _get_iterator(self) -> '_BaseDataLoaderIter':
            if self.num_workers == 0:
                return _SingleProcessDataLoaderIter(self)
            else:
                self.check_worker_number_rationality()
                return _MultiProcessingDataLoaderIter(self)
    

    and

    class _BaseDataLoaderIter(object):
        def __init__(self, loader: DataLoader) -> None:
            ...
            self._index_sampler = loader._index_sampler
    
    • Scenario II & Scenario III

    Both setups are equivalent. We pass a generator to DataLoader and do not specify the sampler. DataLoader automatically creates a RandomSampler object with the generator and assign the same generator to self.generator.

    • Scenario I

    We pass a sampler to DataLoader with the right generator but do not explicitly specify the keyword argument generator in DataLoader.__init__(...). DataLoader initializes the sampler with the given sampler but uses the default generator None for self.generator and the _BaseDataLoaderIter object returned by self._get_iterator().