Search code examples
nlphuggingface-transformersbert-language-modeltransformer-model

How does BERT word embedding preprocess work


I'm trying to figure out what BERT preprocess does. I mean, how it is done. But I can't find a good explanation. I would appreciate, if somebody know, a link to a better and deeply explained solution. If someone, by the other hand, wants to solve it here, I would be also extremly thankful!

My question is, how does BERT mathematically convert a string input into a vector of numbers with fixed size? Which are the logical steps that follows?


Solution

  • BERT provides its own tokenizer. Because BERT is a pretrained model that expects input data in a specific format, following are required:

    • A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
    • A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.
    • Tokens that conform with the fixed vocabulary used in BERT
    • The Token IDs for the tokens, from BERT’s tokenizer
    • Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
    • Segment IDs used to distinguish different sentences
    • Positional Embeddings used to show token position within the sequence

    .

    from transformers import BertTokenizer
    
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    # An example sentence 
    text = "Sentence to embed"
    
    # Add the special tokens.
    marked_text = "[CLS] " + text + " [SEP]"
    
    # Split the sentence into tokens.
    tokenized_text = tokenizer.tokenize(marked_text)
    
    # Map the token strings to their vocabulary indices.
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) 
    

    Have a look at this excellent tutorial for more details.