After I trained word embeddings, I saved it as npz format. While I am trying to load it as KeyedVectors format, it makes errors. How can I load numpy array as gensim.KeyedVectors format? I really need it because I need to use functions like most_similar() not just vector values.
in model.py with tensorflow,
self.verb_embeddings = tf.Variable(np.load(cfg.pretrained_target)["embeddings"],
name="verb_embeddings",
dtype=tf.float32,
trainable=cfg.tune_emb)
in saving.py
target_emb = sess.run(model.verb_embeddings)
np.savez_compressed("trained_target_emb.npz", embeddings=target_emb)
in main.py
model = KeyedVectors.load('trained_target_emb.npz')
I got
_pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified.
also tried
model = KeyedVectors.load_word2vec_format('trained_target_emb.npz')
but got
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 14: invalid continuation byte
Gensim KeyedVectors
instances can't be loaded from a mere raw array: there's no information about which words are represented, and which indexes hold which words.
The plain .load()
in gensim expects objects that were saved from gensim, using gensim's own .save()
method.
Word vectors can be loaded from files that are in the same format as was used by the original Google/Mikolov word2vec.c
tool. So perhaps your tensorflow code can save them that way?
Then, you'd use .load_word2vec_format()
.