nlp tokenize bert-language-model huggingface-transformers

getting word-level encodings from sub-word tokens encodings

I'm looking into using a pretrained BERT ('bert-base-uncased') model to extract contextualised word-level encodings from a bunch sentences.

Wordpiece tokenisation breaks down some of the words in my input into subword units. Possibly a trivial question, but I was wondering what would be the most sensible way to combine output encodings for subword tokens into word-level encodings.

Is averaging subword encodings a reasonable way to go? If not, is there any better alternative?

Solution

Intuitively, your problem seems similar to "how to get a good sentence representation", with the exception that these days you could also use a classification token of a sentence to get a sentence representation in most transformer-based models. Such token is not available for token-level representations, though.

In your case, I think there are a few options but from what I've seen, people most often use either an average or a max value. In other words: take the average of your subword units, or take the max values. Averaging is the most intuitive place to start, in my opinion.

Note that averages are only just that, an average over a sequence. This implies that it is not super accurate (one high and one low value will have the same mean as two medium values), but it's probably the most straightforward.