python nlp tokenize word-embedding huggingface-tokenizers

Extracting embedding values of NLP pertained models from tokenized strings

I am using huggingface pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the original sentence. I need to retrieve word embedding of a particular sentence.

For example, here is my code:

#https://discuss.huggingface.co/t/extracting-token-embeddings-from-pretrained-language-models/6834/6

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re

model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)

def find_wordNo_sentence(word, sentence):
    
    print(sentence)
    splitted_sen = sentence.split(" ")
    print(splitted_sen)
    index = splitted_sen.index(word)


    for i,w in enumerate(splitted_sen):
        if(word == w):
            return i

    print("not found") #0 base




  
def return_xlnet_embedding(word, sentence):
        
    word = re.sub(r'[^\w]', " ", word)
    word = " ".join(word.split())
    
    sentence = re.sub(r'[^\w]', ' ', sentence)
    sentence = " ".join(sentence.split())
    
    id_word = find_wordNo_sentence(word, sentence)
    
   
        
    try:
        data = model_pipeline(sentence)
        
        n_words = len(sentence.split(" "))
        #print(sentence_emb.shape)
        n_embs  = len(data[0])
        print(n_embs, n_words)
        print(len(data[0]))
    
        if (n_words != n_embs):
            "There is extra tokenized word"
            
            
        results = data[0][id_word]  
        return np.array(results)
    
    except:
        return "word not found"

return_xlnet_embedding('your', "what is your name?")

Then the output is:

what is your name ['what', 'is', 'your', 'name'] 6 4 6

So the length of tokenized string that is fed to the pipeline is two more than number of my words. How can I find which one (among these 6 values) are the embedding of my word?

Solution

As you may know, huggingface tokenizer contains frequent subwords as well as complete ones. So if you are willing to extract word embeddings for some tokens you should consider that may contain more than one vector! In addition, huggingface pipelines encode input sentences at the first steps and this would be performed by adding special tokens to beginning & end of the actual sentence.

string = 'This is a test for clarification'
print(pipeline.tokenizer.tokenize(string))
print(pipeline.tokenizer.encode(string))

output:

['this', 'is', 'a', 'test', 'for', 'cl', '##ari', '##fication']

[101, 2023, 2003, 1037, 3231, 2005, 18856, 8486, 10803, 102]