I would like to use bert for tokenization and also indexing for a seq2seq model and this is how my config file looks like so far:
{
"dataset_reader": {
"type": "seq2seq",
"end_symbol": "[SEP]",
"quoting": 3,
"source_token_indexers": {
"tokens": {
"type": "pretrained_transformer",
"model_name": "bert-base-german-cased"
}
},
"source_tokenizer": {
"type": "pretrained_transformer",
"model_name": "bert-base-german-cased"
},
"start_symbol": "[CLS]",
"target_token_indexers": {
"tokens": {
"namespace": "tokens"
}
},
"target_tokenizer": {
"type": "pretrained_transformer",
"add_special_tokens": true,
"model_name": "bert-base-german-cased"
}
},
and later when I load the model and use predictor.predict_json
to predict sentences, the output looks like this.
'predicted_tokens': ['[CLS]', 'Die', 'meisten', 'Universitäts', '##abs', '##ch', '##lüsse', 'sind', 'nicht', 'p', '##raxis', '##orient', '##iert', 'und', 'bereit', '##en', 'die', 'Studenten', 'nicht', 'auf', 'die', 'wirklich', '##e', 'Welt', 'vor', '.', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]']
I have 2 questions:
Thanks
add_special_tokens = False
.tokenizer.convert_tokens_to_string
(which takes the list of subword tokens as input), where tokenizer
refers to the tokenizer used by your DatasetReader.Please let us know if you have further questions!