TypeError: unsupported operand type(s) for +: 'generator' and 'generator'
for the next line:
vocab = build_vocab(train_data + dev_data)
I suspect the return value should be AllennlpDataset but maybe I got it mixed up. What did I do wrong?
Full code:
train_path = <some_path>
test_path = <some_other_path>
class ClassificationJobReader(DatasetReader):
def __init__(self,
lazy: bool = False,
tokenizer: Tokenizer = None,
token_indexers: Dict[str, TokenIndexer] = None,
max_tokens: int = None):
super().__init__(lazy)
self.tokenizer = tokenizer or WhitespaceTokenizer()
self.token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
self.max_tokens = max_tokens
def _read(self, file_path: str) -> Iterable[Instance]:
df = pd.read_parquet(data_path)
for idx in df.index:
text = row['title'][idx] + ' ' + row['description'][idx]
print(f'text : {text}')
label = row['class_id'][idx]
print(f'label : {label}')
tokens = self.tokenizer.tokenize(text)
if self.max_tokens:
tokens = tokens[:self.max_tokens]
text_field = TextField(tokens, self.token_indexers)
label_field = LabelField(label)
fields = {'text': text_field, 'label': label_field}
yield Instance(fields)
def build_dataset_reader() -> DatasetReader:
return ClassificationJobReader()
def read_data(reader: DatasetReader) -> Tuple[Iterable[Instance], Iterable[Instance]]:
print("Reading data")
training_data = reader.read(train_path)
validation_data = reader.read(test_path)
return training_data, validation_data
def build_vocab(instances: Iterable[Instance]) -> Vocabulary:
print("Building the vocabulary")
return Vocabulary.from_instances(instances)
dataset_reader = build_dataset_reader()
train_data, dev_data = read_data(dataset_reader)
vocab = build_vocab(train_data + dev_data)
Thanks for your help
Please find below the code fix first and the explanation afterwards.
Code Fix
# the extend_from_instances expands your vocabulary with the instances passed as an arg
# and is therefore equivalent to Vocabulary.from_instances(train_data + dev_data)
# previously
vocabulary.extend_from_instances(train_data)
vocabulary.extend_from_instances(dev_data)
Explanation
This is because the AllenNLP API have had couple of breaking changes in allennlp==2.0.1. You can find the changelog here and the upgrade guide here.The guide is outdated as per my understanding (it reflects allennlp<=1.4).
The DatasetReader returns a generator now as opposed to a List previously. DatasetReader used to have a parameter called "lazy" which was for lazy loading data. It was False by default and therefore dataset_reader.read
would return a List previously. However, as of v2.0 (if i remember exactly), lazy loading is applied by default and it therefore returns a generator by default. As you know, the "+" operator has not been overridden for generator objects and therefore you cannot simply add two generators.
So, you can simply use vocab.extend_from_instances
to achieve same behavior as before. Hope this helped you. If you need a full code snippet, please leave a comment below, I could post a rekated gist and share it with you.
Good day!