How can I map the tokens I get from huggingface DistilBertTokenizer
to the positions of the input text?
e.g. I have a new GPU
-> ["i", "have", "a", "new", "gp", "##u"]
-> [(0, 1), (2, 6), ...]
I'm interested in this because suppose that I have some attention values assigned to each token, I would like to show which part of the original text it actually corresponds to, since the tokenized version is not non-ML people friendly.
I have not found any solution to this yet especially when there is [UNK]
token. Any insights would be appreciated. Thank you!
In the newer versions of Transformers (it seems like since 2.8), calling the tokenizer returns an object of class BatchEncoding
when methods __call__
, encode_plus
and batch_encode_plus
are used. You can use method token_to_chars
that takes the indices in the batch and returns the character spans in the original string.