Interpreting the output tokenization of BERT for a given word

from bert_embedding import BertEmbedding
bert_embedding = BertEmbedding(model='bert_12_768_12', dataset_name='wiki_multilingual_cased')
output = bert_embedding("any")

I need clarification on the output of mBERT embeddings. I'm aware that WordPiece tokenization is used to break up the input text. Also I observed that on providing a single word (say "any") as input, the output has length equal to the number of characters in the input (in our case, 3). output[i] is a tuple of lists where the first list contains the character at i^th position with the 'unknown' token preceding and following it as different elements in the array. Following this are three (= length of the input word) arrays (embeddings) of size 768 each. Why does the output seem to be tokenized character-wise (rather than wordpiece tokenized)?

Also found out the output form changes when the input is given in a list as:bert_embedding(["any"]). The output now is a single tuple with ['[UNK]', 'state', '[UNK]'] as the first element followed by three different embeddings conceivably corresponding to the three tokens listed above.

If I need the embedding of the last subword (not simply of the last character or the whole word) for a given input word, how do I access it?

Solution

Checked their github page. About the input format: YES it is expected as a list (of strings). Also this particular implementation provides token ( = word ) level embeddings; so subword level embedings can't be retrieved directly although it provides a choice on how the word embeddings should be derived from their subword components ( by taking avg which is default or taking sum or just the last subword embedding). Refer to the Hugggingface interface for BERT for a finer control over how the embeddings are taken e.g. from the different layers and using which operations.