nlp huggingface-transformers bert-language-model transformer-model

How does BERT word embedding preprocess work

I'm trying to figure out what BERT preprocess does. I mean, how it is done. But I can't find a good explanation. I would appreciate, if somebody know, a link to a better and deeply explained solution. If someone, by the other hand, wants to solve it here, I would be also extremly thankful!

My question is, how does BERT mathematically convert a string input into a vector of numbers with fixed size? Which are the logical steps that follows?

Solution

BERT provides its own tokenizer. Because BERT is a pretrained model that expects input data in a specific format, following are required:

A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.
Tokens that conform with the fixed vocabulary used in BERT
The Token IDs for the tokens, from BERT’s tokenizer
Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
Segment IDs used to distinguish different sentences
Positional Embeddings used to show token position within the sequence

from transformers import BertTokenizer

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# An example sentence 
text = "Sentence to embed"

# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"

# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)

# Map the token strings to their vocabulary indices.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

Have a look at this excellent tutorial for more details.