Search code examples
tokenizeencodertensorflow-hub

Textual representation of LaBSE preprocessor output?


I use the following model to tokenize sentences from multiple languages: https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2

Which, for the following input:

"I wish you a pleasant flight and a good meal aboard this plane."

outputs the following tokens:

[101, 146, 34450, 15100, 170, 147508, 48088, 14999, 170, 17072, 66369, 351617, 15272, 69746, 119, 102]

From this output, I would like to recover a textual representation of the tokens. Something like :

[START, I, wish, ..., plane, .]

So far I've been looking for the token<=>text mapping, but found resources mostly about BERT, which has got several MONO-lingual models, while I want to stay language-agnostic.

Anyclue about how to do that ?

Thanks in advance for your help,


Solution

  • The default cache location for the google/universal-sentence-encoder-cmlm/multilingual-preprocess/2 model is /tmp/tfhub_modules/8e75887695ac632ead11c556d4a6d45194718ffb (more on caching). In the assets directory, you'll find cased_vocab.txt, which is the used vocabulary:

    !cat /tmp/tfhub_modules/.../assets/cased_vocab.txt | sed -n 102p
    > [CLS]
    !cat /tmp/tfhub_modules/.../assets/cased_vocab.txt | sed -n 147p
    > I
    !cat /tmp/tfhub_modules/.../assets/cased_vocab.txt | sed -n 34451p
    > wish
    ...
    

    Note that sed assumes 1-based indexing while the output of the preprocessor is 0-based.