When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.
I have a the following string:
test_string = 'text with percentage%'
Then I am running the following code:
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
test_string = 'text with percentage%'
# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)
And the output looks like this:
'text with percentage %'
With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces
but this is for something different.
How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.
If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast
with the option return_offsets_mapping=True
.
test_string = 'text with percentage%'
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]
span_start_index, span_stop_index = some_model(input_ids)
Then once you get the token classification results, you can do something like
predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]