Search code examples
pytorchpytorch-datapipe

getting only negativ samples from PyTorch IMDb dataset


I am trying to visualize several PyTorch datasets. For the IMDb dataset I am getting only negative training samples. In the original dataset the positive and the negative samples are balanced.

This is the code I am using. It is based on the T5 Tutorial

from torch.utils.data import DataLoader
from functools import partial
from torchtext.datasets import IMDB

imdb_datapipe = IMDB(split='test')

labels = {"1": "negative", "2": "positive"}
def process_labels(labels, x):
    return x[1], labels[str(x[0])]


imdb_datapipe = imdb_datapipe.map(partial(process_labels, labels))
imdb_datapipe = imdb_datapipe.batch(2)
imdb_datapipe = imdb_datapipe.shuffle()
imdb_datapipe = imdb_datapipe.rows2columnar(["text", "label"])
imdb_dataloader = DataLoader(imdb_datapipe, batch_size=None)

it = iter(imdb_dataloader)

for _ in range(10):
    sample = next(it)
    for text,label in zip(sample['text'], sample['label']):
        print(f"{label}: {text[:100]}")

What am I missing?


Solution

  • Ran your code in a clean (Colab) environment and everything works, getting both positive and negative examples: output screenshot

    It could be an environment issue. Perhaps, try to reinstall torchtext and run your code again. torchtext==0.15.2 with torch==2.0.1 works for me.