Search code examples
utf-8gensimword2vec

'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte


I want to use Word2Vec, and i have download a Word2Vec's corpus in indonesian language, but when i call it, it was give me an error, this is what i try :

Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)

and it was give me an error, like this :

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-73-219e152ee7d9> in <module>()
----> 1 Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)

2 frames
/usr/local/lib/python3.7/dist-packages/gensim/utils.py in any2unicode(text, encoding, errors)
    353     if isinstance(text, unicode):
    354         return text
--> 355     return unicode(text, encoding, errors=errors)
    356 
    357 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

Solution

  • A file named idwiki_word2vec_100_new_lower.model.wv.vectors.npy is unlikely to be in the format needed by load_word2vec_format().

    The .npy suggests it is a raw numpy array, which is not the format expected.

    Also, the .wv.vectors. section suggests this could be part of a full, multi-file Gensim .save() of a complete Word2Vec model. That's more than just the vectors, & requires all associated files to re-load.

    You should double-check the source of the vectors and what their claims are about its format and the proper ways to load. (If you're still having problems & need more guidance, you should specify more details about the origin of the file – for example a link to the website where it was obtained – to support other suggestions.)