I want to use Word2Vec, and i have download a Word2Vec's corpus in indonesian language, but when i call it, it was give me an error, this is what i try :
Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)
and it was give me an error, like this :
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-73-219e152ee7d9> in <module>()
----> 1 Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)
2 frames
/usr/local/lib/python3.7/dist-packages/gensim/utils.py in any2unicode(text, encoding, errors)
353 if isinstance(text, unicode):
354 return text
--> 355 return unicode(text, encoding, errors=errors)
356
357
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
A file named idwiki_word2vec_100_new_lower.model.wv.vectors.npy
is unlikely to be in the format needed by load_word2vec_format()
.
The .npy
suggests it is a raw numpy
array, which is not the format expected.
Also, the .wv.vectors.
section suggests this could be part of a full, multi-file Gensim .save()
of a complete Word2Vec
model. That's more than just the vectors, & requires all associated files to re-load.
You should double-check the source of the vectors and what their claims are about its format and the proper ways to load. (If you're still having problems & need more guidance, you should specify more details about the origin of the file – for example a link to the website where it was obtained – to support other suggestions.)