I would like to download and load the pre-trained word2vec for analyzing Korean text.
I download the pre-trained word2vec here: https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view?resourcekey=0-Dq9yyzwZxAqT3J02qvnFwg from the Github Pre-trained word vectors of 30+ languages: https://github.com/Kyubyong/wordvectors
My gensim version is 4.1.0, thus I used:
KeyedVectors.load_word2vec_format('./ko.bin', binary=False)
to load the model. But there was an error that :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I already tried many options including in stackoverflow and Github, but it still not work well. Would you mind letting me the suitable solution?
Thanks,
While the page at https://github.com/Kyubyong/wordvectors isn't clear about the formats this author has chosen, by looking at their source code at...
https://github.com/Kyubyong/wordvectors/blob/master/make_wordvectors.py#L61
...shows it using the Gensim model .save()
method.
Such saved models should be reloaded using the .load()
class method of the same model class. For example, if a Word2Vec
model was saved with...
model.save('language.bin')
...then it could be reloaded with...
loaded_model = Word2Vec.load('language.bin')
Note, through, that:
.load()
the model. Then, you could save the plain vectors out with .save_word2vec_format()
for later reloading across any version. (Or, using the latest interim version that can load the model, re-save the model as .save()
, then repeat the process with the latest version that can read that model, until you reach the current Gensim.)But, you also might want to find a more recent & better-documented set of pretrained word-vectors.
For example, Facebook makes FastText pretrained vectors available in both a 'text' format and a 'bin' format for many languages at https://fasttext.cc/docs/en/pretrained-vectors.html (trained on Wikipedia only) or https://fasttext.cc/docs/en/crawl-vectors.html (trained on Wikipedia plus web crawl data).
The 'text' format should in fact be loadable with KeyedVectors.load_word2vec_format(filename, binary=False)
, but will only include full-word vectors. (It will also be relatively easy to view as text, or write simply code to massage into other formats.)
The 'bin' format is Facebook's own native FastText model format, and should be loadable with either the load_facebook_model()
or load_facebook_vectors()
utility methods. Then, the loaded model (or vectors) will be able to create the FastText algorithm's substring-based guesstimate vectors even for many words that weren't in the model or training data.