Search code examples
tokenizehuggingface-transformershuggingface-tokenizers

Mapping huggingface tokens to original input text


How can I map the tokens I get from huggingface DistilBertTokenizer to the positions of the input text?

e.g. I have a new GPU -> ["i", "have", "a", "new", "gp", "##u"] -> [(0, 1), (2, 6), ...]

I'm interested in this because suppose that I have some attention values assigned to each token, I would like to show which part of the original text it actually corresponds to, since the tokenized version is not non-ML people friendly.

I have not found any solution to this yet especially when there is [UNK] token. Any insights would be appreciated. Thank you!


Solution

  • In the newer versions of Transformers (it seems like since 2.8), calling the tokenizer returns an object of class BatchEncoding when methods __call__, encode_plus and batch_encode_plus are used. You can use method token_to_chars that takes the indices in the batch and returns the character spans in the original string.