I am new to NLP and I am running into this issue that I do not understand at all:
I have a text file with gloVe vectors. I converted it to Word2Vec using
glove2word2vec(TXT_FILE_PATH, KV_FILE_PATH)
this creates a KV file in my path which can then be loaded using
word_vectors = KeyedVectors.load_word2vec_format(KV_FILE_PATH, binary=False)
I then save it using
word_vectors.save(KV_FILE_PATH)
But when I now try to use the new KV file in intersect_word2vec_format it gives me an encoding error
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-11-d975bb14af37> in <module>
6
7 print("Intersect with pre-trained model...")
----> 8 model.intersect_word2vec_format(KV_FILE_PATH, binary=False)
9
10 print("Train custom word2vec model...")
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/gensim/models/word2vec.py in intersect_word2vec_format(self, fname, lockf, binary, encoding, unicode_errors)
890 logger.info("loading projection weights from %s", fname)
891 with utils.open(fname, 'rb') as fin:
--> 892 header = utils.to_unicode(fin.readline(), encoding=encoding)
893 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
894 if not vector_size == self.wv.vector_size:
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/gensim/utils.py in any2unicode(text, encoding, errors)
366 if isinstance(text, unicode):
367 return text
--> 368 return unicode(text, encoding, errors=errors)
369
370
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The .save()
method saves a model in Gensim's native format - which is primarily Python pickling, with large arrays as separate files (which must be kept alongside the main save file).
That format is not the same as the word2vec_format
that can be loaded by load_word2vec_format()
or intersect_word2vec_format()
.
If you want to save a set of vectors into the word2vec_format
, use the method .save_word2vec_format()
, not plain .save()
.