Search code examples
pythontensorflowgensimword2vectensorboard

How to visualize Gensim Word2vec Embeddings in Tensorboard Projector


Following gensim word2vec embedding tutorial, I have trained a simple word2vec model:

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.save("/content/word2vec.model")

I would like to visualize it using the Embedding Projector in TensorBoard. There is another straightforward tutorial in gensim documentation. I did the following in Colab:

!python3 -m gensim.scripts.word2vec2tensor -i /content/word2vec.model -o /content/my_model

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 94, in <module>
    word2vec2tensor(args.input, args.output, args.binary)
  File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 68, in word2vec2tensor
    model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py", line 172, in _load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "/usr/local/lib/python3.7/dist-packages/gensim/utils.py", line 355, in any2unicode
    return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Please note that I did check first this exact same question from 2018 - but the accepted answer no longer works as both in gensim and tensorflow have been updated so I considered it was worth asking again in Q4 2021.


Solution

  • Saving the model in the original C word2vec implementation format resolves the issue: model.wv.save_word2vec_format("/content/word2vec.model"):

    from gensim.test.utils import common_texts
    from gensim.models import Word2Vec
    model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
    model.wv.save_word2vec_format("/content/word2vec.model")
    

    There are two formats of storing word2vec models in gensim: keyed vector format from the original word2vec implementation and format that additionally stores hidden weights, vocabulary frequencies, and more. Examples and details can be found in the documentation. The script word2vec2tensor.py uses the original format and loads the model with load_word2vec_format: code.