Search code examples
pythonpytorchtokenizetorchbert-language-model

BertTokenizer - when encoding and decoding sequences extra spaces appear


When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.

I have a the following string:

test_string = 'text with percentage%'

Then I am running the following code:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_string = 'text with percentage%'

# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)

And the output looks like this:

'text with percentage %'

With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces but this is for something different.

How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.


Solution

  • If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast with the option return_offsets_mapping=True.

    test_string = 'text with percentage%'
    
    tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
    tokens = tokenizer(test_string, return_offsets_mapping=True)
    input_ids = tokens.data["input_ids"]
    
    span_start_index, span_stop_index = some_model(input_ids)
    

    Then once you get the token classification results, you can do something like

    predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]