Search code examples

How to turn tensor type to original text (before tokenized) in Pytorch

for example, a tensor type data below is tokenized by a kind of English tokenizer.

tensor([[ 2992,  1852,  9439,  ...,  2610,  1704, 29189],
        [ 1852,  9439,     7,  ...,  1704, 29189, 23223],
        [ 9439,     7,  2367,  ..., 29189, 23223,   838],
        [   12,  7469, 28844,  ...,  2973,    16,    73],
        [ 7469, 28844, 28469,  ...,    16,    73,   735],
        [28844, 28469,   191,  ...,    73,   735,  4482]])

how to transform it to original English text? (using Pytorch)


  • The method you're looking for is tokenizer.decode, which is applied to sequences of numbers to yield the original source text. In your case, you have a batch of sentences (i.e. sequence of sequences) so you'll need to iterate the function over your tensor, i.e.

    decoded = [tokenizer.decode(x) for x in xs]

    where tokenizer your tokenization model and xs the tensor you want to decode.

    maybe also useful:

    tokenizer also provides methods convert_ids_to_tokens which does what the name suggests, and convert_tokens_to_string which merges subword tokens into words to recover the original input.