Search code examples
allennlp

AllenNLP - dataset_reader config for transformers


I would like to use bert for tokenization and also indexing for a seq2seq model and this is how my config file looks like so far:

{
"dataset_reader": {
    "type": "seq2seq",
    "end_symbol": "[SEP]",
    "quoting": 3,
    "source_token_indexers": {
        "tokens": {
            "type": "pretrained_transformer",
            "model_name": "bert-base-german-cased"
        }
    },
    "source_tokenizer": {
        "type": "pretrained_transformer",
        "model_name": "bert-base-german-cased"
    },
    "start_symbol": "[CLS]",
    "target_token_indexers": {
        "tokens": {
            "namespace": "tokens"
        }
    },
    "target_tokenizer": {
        "type": "pretrained_transformer",
        "add_special_tokens": true,
        "model_name": "bert-base-german-cased"
    }
},

and later when I load the model and use predictor.predict_json to predict sentences, the output looks like this.

'predicted_tokens': ['[CLS]', 'Die', 'meisten', 'Universitäts', '##abs', '##ch', '##lüsse', 'sind', 'nicht', 'p', '##raxis', '##orient', '##iert', 'und', 'bereit', '##en', 'die', 'Studenten', 'nicht', 'auf', 'die', 'wirklich', '##e', 'Welt', 'vor', '.', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]', '[SEP]']

I have 2 questions:

  1. is this a normal output (considering all the '[SEP]' tokens in the end)? or am I doing something wrong in the config file?
  2. is there any function that would convert these tokens to a human-readable sentence?

Thanks


Solution

    1. Please set add_special_tokens = False.
    2. Use tokenizer.convert_tokens_to_string (which takes the list of subword tokens as input), where tokenizer refers to the tokenizer used by your DatasetReader.

    Please let us know if you have further questions!