This is my code, the function work well for train set but for test set returns this error RuntimeError: Token second\team not found and default index is not set
train_data, train_labels = text_classification._create_data_from_iterator(
vocab, text_classification._csv_iterator(train_csv_path, ngrams, yield_cls=True), False)
test_data, test_labels = text_classification._create_data_from_iterator(
vocab, text_classification._csv_iterator(test_csv_path, ngrams, yield_cls=True), False)
Does anyone know what is wrong?
The vocabulary acts as a lookup table for your data translating str
to int
. When a given string (in this case "second\team") doesn't appear in the vocabulary, there are two strategies to compensate:
KeyError
when calling {}[1]
in Python{}.get(1, "I don't know!")
in Python.Your code is currently doing #1. You seem to want #2 which you can achieve using vocab.set_default_index
. When you build your vocab, add the specials=["<unk>"]
kwarg and then call vocab.set_default_index(vocab['<unk>'])
.