Search code examples
pytorchtokenizehuggingface-transformers

Tokens to Words mapping in the tokenizer decode step huggingface?


Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function?
For example:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str) 
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str) 
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids']) 
## '<s> this is a tokenization example</s>'

And the objective is to have a function that maps each token in the decode process to the correct input word, for here it will be:
desired_output = [[1],[2],[3],[4,5],[6]]
As this corresponds to id 42, while token and ization corresponds to ids [19244,1938] which are at indexes 4,5 of the input_ids array.


Solution

  • If you use the fast tokenizers, i.e. the rust backed versions from the tokenizers library the encoding contains a word_ids method that can be used to map sub-words back to their original word. What constitutes a word vs a subword depends on the tokenizer, a word is something generated by the pre-tokenization stage, i.e. split by whitespace, a subword is generated by the actual model (BPE or Unigram for example).

    The code below should work in general, even if the pre-tokenization performs additional splitting. For example I created my own custom step that splits based on PascalCase - the words here are Pascal and Case, the accepted answer wont work in this case since it assumes words are whitespace delimited.

    from transformers import AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)
    
    example = "This is a tokenization example"
    
    encoded = tokenizer(example)
    
    desired_output = []
    for word_id in encoded.word_ids():
        if word_id is not None:
            start, end = encoded.word_to_tokens(word_id)
            if start == end - 1:
                tokens = [start]
            else:
                tokens = [start, end-1]
            if len(desired_output) == 0 or desired_output[-1] != tokens:
                desired_output.append(tokens)
    desired_output