I'm trying to figure out what BERT
preprocess does. I mean, how it is done. But I can't find a good explanation. I would appreciate, if somebody know, a link to a better and deeply explained solution.
If someone, by the other hand, wants to solve it here, I would be also extremly thankful!
My question is, how does BERT
mathematically convert a string input into a vector of numbers with fixed size? Which are the logical steps that follows?
BERT provides its own tokenizer. Because BERT is a pretrained model that expects input data in a specific format, following are required:
[SEP]
, to mark the end of a sentence, or the
separation between two sentences[CLS]
, at the
beginning of our text. This token is used for classification tasks,
but BERT expects it no matter what your application is..
from transformers import BertTokenizer
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# An example sentence
text = "Sentence to embed"
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indices.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
Have a look at this excellent tutorial for more details.