Search code examples
nlphuggingface-transformersbert-language-modelword-embedding

The inputs into BERT are token IDs. How do I get the corresponding the input token VECTORs into BERT?


I am new and learning about transformers.

In a lot of BERT tutorials, I see the input is just the token id of the words. But surely we need to convert this token ID to a vector representation (it can be one hot encoding, or any initial vector representation for each token ID) so that it can be used by the model.

My question is: Where cam I find this initial vector representation for each token?


Solution

  • In BERT, the input is a string itself. THen, BERT manages to convert it into a token and then, create its vector. Let's see an example:

    prep_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
    enc_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4' 
    bert_preprocess = hub.KerasLayer(prep_url)
    bert_encoder = hub.KerasLayer(enc_url)
    
    text = ['Hello I"m new to stack overflow']
    
    # First, you need to preprocess the data
    
    preprocessed_text = bert_preprocess(text)
    # this will give you a dict with a few keys such us input_word_ids, that is, the tokenizer
    
    encoded = bert_encoder(preprocessed_text)
    # and this will give you the (1, 768) vector with the context value of the previous text. the output is encoded['pooled_output']
    
    # you can play with both dicts, printing its keys()
    

    I recommend you to go to both links above and do a little of research. To recap, BERT uses string as inputs and then tokenize it (with its own tokenzer!). If you want to tokenize with the same values, you need the same vocab file, but for a fresh start like you are doing this should be enough.