Search code examples
pythonnumpytensorflowgensimembedding

How to load numpy array to gensim Keyedvector format?


After I trained word embeddings, I saved it as npz format. While I am trying to load it as KeyedVectors format, it makes errors. How can I load numpy array as gensim.KeyedVectors format? I really need it because I need to use functions like most_similar() not just vector values.

in model.py with tensorflow,

self.verb_embeddings = tf.Variable(np.load(cfg.pretrained_target)["embeddings"],
                                               name="verb_embeddings",
                                               dtype=tf.float32,
                                               trainable=cfg.tune_emb)

in saving.py

target_emb = sess.run(model.verb_embeddings)
np.savez_compressed("trained_target_emb.npz", embeddings=target_emb)

in main.py

 model = KeyedVectors.load('trained_target_emb.npz')

I got

_pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified.

also tried

 model = KeyedVectors.load_word2vec_format('trained_target_emb.npz')

but got

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 14: invalid continuation byte

Solution

  • Gensim KeyedVectors instances can't be loaded from a mere raw array: there's no information about which words are represented, and which indexes hold which words.

    The plain .load() in gensim expects objects that were saved from gensim, using gensim's own .save() method.

    Word vectors can be loaded from files that are in the same format as was used by the original Google/Mikolov word2vec.c tool. So perhaps your tensorflow code can save them that way?

    Then, you'd use .load_word2vec_format().