Search code examples
gensimword2vec

Can't load the pre-trained word2vec of korean language


I would like to download and load the pre-trained word2vec for analyzing Korean text.

I download the pre-trained word2vec here: https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view?resourcekey=0-Dq9yyzwZxAqT3J02qvnFwg from the Github Pre-trained word vectors of 30+ languages: https://github.com/Kyubyong/wordvectors

My gensim version is 4.1.0, thus I used: KeyedVectors.load_word2vec_format('./ko.bin', binary=False) to load the model. But there was an error that :

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I already tried many options including in stackoverflow and Github, but it still not work well. Would you mind letting me the suitable solution?

Thanks,


Solution

  • While the page at https://github.com/Kyubyong/wordvectors isn't clear about the formats this author has chosen, by looking at their source code at...

    https://github.com/Kyubyong/wordvectors/blob/master/make_wordvectors.py#L61

    ...shows it using the Gensim model .save() method.

    Such saved models should be reloaded using the .load() class method of the same model class. For example, if a Word2Vec model was saved with...

    model.save('language.bin')
    

    ...then it could be reloaded with...

    loaded_model = Word2Vec.load('language.bin')
    

    Note, through, that:

    • Models saved this way are often split over multiple files that should be kept together (and all start with the same root name) - but I don't see those here.
    • This work appears to be ~5 years old, based on a pre-1.0 version of Gensim – so there might be issues loading the models directly into the latest Gensim. If you do run into such issues, & absolutely need to make these vectors work, you might need to temporarily use a prior version of Gensim to .load() the model. Then, you could save the plain vectors out with .save_word2vec_format() for later reloading across any version. (Or, using the latest interim version that can load the model, re-save the model as .save(), then repeat the process with the latest version that can read that model, until you reach the current Gensim.)

    But, you also might want to find a more recent & better-documented set of pretrained word-vectors.

    For example, Facebook makes FastText pretrained vectors available in both a 'text' format and a 'bin' format for many languages at https://fasttext.cc/docs/en/pretrained-vectors.html (trained on Wikipedia only) or https://fasttext.cc/docs/en/crawl-vectors.html (trained on Wikipedia plus web crawl data).

    The 'text' format should in fact be loadable with KeyedVectors.load_word2vec_format(filename, binary=False), but will only include full-word vectors. (It will also be relatively easy to view as text, or write simply code to massage into other formats.)

    The 'bin' format is Facebook's own native FastText model format, and should be loadable with either the load_facebook_model() or load_facebook_vectors() utility methods. Then, the loaded model (or vectors) will be able to create the FastText algorithm's substring-based guesstimate vectors even for many words that weren't in the model or training data.