python pytorch tensor huggingface-transformers

How to turn tensor type to original text (before tokenized) in Pytorch

for example, a tensor type data below is tokenized by a kind of English tokenizer.

tensor([[ 2992,  1852,  9439,  ...,  2610,  1704, 29189],
        [ 1852,  9439,     7,  ...,  1704, 29189, 23223],
        [ 9439,     7,  2367,  ..., 29189, 23223,   838],
        ...,
        [   12,  7469, 28844,  ...,  2973,    16,    73],
        [ 7469, 28844, 28469,  ...,    16,    73,   735],
        [28844, 28469,   191,  ...,    73,   735,  4482]])

how to transform it to original English text? (using Pytorch)

Solution

The method you're looking for is tokenizer.decode, which is applied to sequences of numbers to yield the original source text. In your case, you have a batch of sentences (i.e. sequence of sequences) so you'll need to iterate the function over your tensor, i.e.

decoded = [tokenizer.decode(x) for x in xs]

where tokenizer your tokenization model and xs the tensor you want to decode.

maybe also useful:

tokenizer also provides methods convert_ids_to_tokens which does what the name suggests, and convert_tokens_to_string which merges subword tokens into words to recover the original input.