Search code examples
pythonhuggingface-transformerstransformer-modelgpt-2

Understanding how gpt-2 tokenizes the strings


Using tutorials here , I wrote the following codes:

from transformers import GPT2Tokenizer, GPT2Model
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

So I realize that "inputs", consists of tokenized items of my sentence. But how can I get the values of tokenized items? (see for example ["hello", ",", "my", "dog", "is", "cute"])

I am asking this because sometimes I think it separetes a word if that word is not in its dictionary (i.e., a word from another language). So I want to check that in my codes.


Solution

  • You can call tokenizer.decode on the output of the tokenizer to get the words from its vocabulary under given indices:

    >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    >>> list(map(tokenizer.decode, inputs.input_ids[0]))
    ['Hello', ',', ' my', ' dog', ' is', ' cute']