AllenNLP DatasetReader.read returns generator instead of AllennlpDataset

While studying AllenNLP framework (version 2.0.1), I tried to implement the example code from https://guide.allennlp.org/training-and-prediction#1.
While reading the data from a Parquet file I got:

TypeError: unsupported operand type(s) for +: 'generator' and 'generator'

for the next line:

vocab = build_vocab(train_data + dev_data)

I suspect the return value should be AllennlpDataset but maybe I got it mixed up. What did I do wrong?

Full code:

train_path = <some_path>
test_path = <some_other_path>

class ClassificationJobReader(DatasetReader):
    def __init__(self,
                 lazy: bool = False,
                 tokenizer: Tokenizer = None,
                 token_indexers: Dict[str, TokenIndexer] = None,
                 max_tokens: int = None):
        super().__init__(lazy)
        self.tokenizer = tokenizer or WhitespaceTokenizer()
        self.token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
        self.max_tokens = max_tokens

    def _read(self, file_path: str) -> Iterable[Instance]:
      df = pd.read_parquet(data_path)
      for idx in df.index:
        text = row['title'][idx] + ' ' + row['description'][idx]
        print(f'text : {text}')
        label = row['class_id'][idx]
        print(f'label : {label}')
        tokens = self.tokenizer.tokenize(text)
        if self.max_tokens:
            tokens = tokens[:self.max_tokens]
        text_field = TextField(tokens, self.token_indexers)
        label_field = LabelField(label)
        fields = {'text': text_field, 'label': label_field}
        yield Instance(fields)

def build_dataset_reader() -> DatasetReader:
    return ClassificationJobReader()

def read_data(reader: DatasetReader) -> Tuple[Iterable[Instance], Iterable[Instance]]:
    print("Reading data")
    training_data = reader.read(train_path)
    validation_data = reader.read(test_path)
    return training_data, validation_data

def build_vocab(instances: Iterable[Instance]) -> Vocabulary:
    print("Building the vocabulary")
    return Vocabulary.from_instances(instances)

dataset_reader = build_dataset_reader()
train_data, dev_data = read_data(dataset_reader)
vocab = build_vocab(train_data + dev_data)

Thanks for your help

Solution

Please find below the code fix first and the explanation afterwards.

Code Fix

# the extend_from_instances expands your vocabulary with the instances passed as an arg
# and is therefore equivalent to Vocabulary.from_instances(train_data + dev_data) 
# previously
vocabulary.extend_from_instances(train_data)
vocabulary.extend_from_instances(dev_data)

Explanation

This is because the AllenNLP API have had couple of breaking changes in allennlp==2.0.1. You can find the changelog here and the upgrade guide here.The guide is outdated as per my understanding (it reflects allennlp<=1.4).

The DatasetReader returns a generator now as opposed to a List previously. DatasetReader used to have a parameter called "lazy" which was for lazy loading data. It was False by default and therefore dataset_reader.read would return a List previously. However, as of v2.0 (if i remember exactly), lazy loading is applied by default and it therefore returns a generator by default. As you know, the "+" operator has not been overridden for generator objects and therefore you cannot simply add two generators.

So, you can simply use vocab.extend_from_instances to achieve same behavior as before. Hope this helped you. If you need a full code snippet, please leave a comment below, I could post a rekated gist and share it with you.

Good day!