Search code examples
pythonnlppytorchbert-language-model

Interpreting the output tokenization of BERT for a given word


from bert_embedding import BertEmbedding
bert_embedding = BertEmbedding(model='bert_12_768_12', dataset_name='wiki_multilingual_cased')
output = bert_embedding("any")

I need clarification on the output of mBERT embeddings. I'm aware that WordPiece tokenization is used to break up the input text. Also I observed that on providing a single word (say "any") as input, the output has length equal to the number of characters in the input (in our case, 3). output[i] is a tuple of lists where the first list contains the character at ith position with the 'unknown' token preceding and following it as different elements in the array. Following this are three (= length of the input word) arrays (embeddings) of size 768 each. Why does the output seem to be tokenized character-wise (rather than wordpiece tokenized)?

Also found out the output form changes when the input is given in a list as:bert_embedding(["any"]). The output now is a single tuple with ['[UNK]', 'state', '[UNK]'] as the first element followed by three different embeddings conceivably corresponding to the three tokens listed above.

If I need the embedding of the last subword (not simply of the last character or the whole word) for a given input word, how do I access it?


Solution

  • Checked their github page. About the input format: YES it is expected as a list (of strings). Also this particular implementation provides token ( = word ) level embeddings; so subword level embedings can't be retrieved directly although it provides a choice on how the word embeddings should be derived from their subword components ( by taking avg which is default or taking sum or just the last subword embedding). Refer to the Hugggingface interface for BERT for a finer control over how the embeddings are taken e.g. from the different layers and using which operations.