python deep-learning nlp pytorch text-classification

Pytorch dataloader for sentences

I have collected a small dataset for binary text classification and my goal is to train a model with the method proposed by Convolutional Neural Networks for Sentence Classification

I started my implementation by using the torch.util.data.Dataset. Essentially every sample in my dataset my_data looks like this (as example):

{"words":[0,1,2,3,4],"label":1},
{"words":[4,9,20,30,4,2,3,4,1],"label":0}

Next I took a look at Writing custom dataloaders with pytorch: using:

dataloader = DataLoader(my_data, batch_size=2,
                    shuffle=False, num_workers=4)

I would suspect that enumerating over a batch would yield something the following:

{"words":[[0,1,2,3,4],[4,9,20,30,4,2,3,4,1]],"labels":[1,0]}

However it is more like this:

{"words":[[0,4],[1,9],[2,20],[3,30],[4,4]],"label":[1,0]}

I guess it has something to do that they are not equal size. Do they need to be the same size and if so how can i achieve it? For people knwoing about this paper, what does your training data look like?

edit:

class CustomDataset(Dataset):
def __init__(self, path_to_file, max_size=10, transform=None):

    with open(path_to_file) as f:
        self.data = json.load(f)
    self.transform = transform
    self.vocab = self.build_vocab(self.data)
    self.word2idx, self.idx2word = self.word2index(self.vocab)

def get_vocab(self):
    return self.vocab

def get_word2idx(self):
    return self.word2idx, self.idx2word

def __len__(self):
    return len(self.data)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
    inputs_ = word_tokenize(self.data[idx][0])
    inputs_ = [w for w in inputs_ if w not in stopwords]
    inputs_ = [w for w in inputs_ if w not in punctuation]
    inputs_ = [self.word2idx[w] for w in inputs_]  # convert words to index

    label = {"positive": 1,"negative": 0}
    label_ = label[self.data[idx][1]] #convert label to 0|1

    sample = {"words": inputs_, "label": label_}

    if self.transform:
        sample = self.transform(sample)

    return sample

def build_vocab(self, corpus):
    word_count = {}
    for sentence in corpus:
        tokens = word_tokenize(sentence[0])
        for token in tokens:
            if token not in word_count:
                word_count[token] = 1
            else:
                word_count[token] += 1
    return word_count

def word2index(self, word_count):
    word_index = {w: i for i, w in enumerate(word_count)}
    idx_word = {i: w for i, w in enumerate(word_count)}
    return word_index, idx_word

Solution

As you correctly suspected, this is mostly a problem of different tensor shapes. Luckily, PyTorch offers you several solutions of varying simplicity to achieve what you desire (batch sizes >= 1 for text samples):

The highest-level solution is probably torchtext, which provides several solutions out of the box to load (custom) datasets for NLP tasks. If you can make your training data fit in any one of the described loaders, this is probably the recommended option, as there is a decent documentation and several examples.
If you prefer to build a solution, there are padding solutions like torch.nn.utils.rnn.pad_sequence, in combination with torch.nn.utils.pack_padded_sequence, or the combination of both (torch.nn.utils.rnn.pack_sequence. This generally allows you a lot more flexibility, which may or may not be something that you require.

Personally, I have had good experiences using just pad_sequence, and sacrifice a bit of speed for a much clearer debugging state, and seemingly others have similar recommendations.