Search code examples
gensimword2vec

Gensim built-in model.load function and Python Pickle.load file


I was trying to use Gensim to import GoogelNews-pretrained model on some English words (sampled 15 ones here only stored in a txt file with each per line, and there are no more context as corpus). Then I could use "model.most_similar()" to get their similar words/phrases for them. But actually the file loaded from Python-Pickle method couldn't be used for gensim-built-in model.load() and model.most_similar() function directly.

how should I do to cluster the 15 English words (and more in the future), since I couldn't train and save and load a model from the beginning?

import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

GOOGLE_WORD2VEC_MODEL = '../GoogleNews-vectors-negative300.bin'

GOOGLE_ENGLISH_WORD_PATH = '../testwords.txt'

GOOGLE_WORD_FEATURE = '../word.google.vector'

model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_WORD2VEC_MODEL, binary=True) 

word_vectors = {}

#load 15 words as a test to word_vectors

with open(GOOGLE_ENGLISH_WORD_PATH) as f:
    lines = f.readlines()
    for line in lines:
        line = line.strip('\n')
        if line:                
            word = line
            print(line)
            word_vectors[word]=None
try:
    import cPickle
except :
    import _pickle as cPickle

def save_model(clf,modelpath): 
    with open(modelpath, 'wb') as f: 
        cPickle.dump(clf, f) 

def load_model(modelpath): 
    try: 
        with open(modelpath, 'rb') as f: 
            rf = cPickle.load(f) 
            return rf 
    except Exception as e:        
        return None 

for word in word_vectors:
    try:
        v= model[word]
        word_vectors[word] = v
    except:
        pass

save_model(word_vectors,GOOGLE_WORD_FEATURE)

words_set = load_model(GOOGLE_WORD_FEATURE)

words_set.most_similar("knit", topn=3)
---------------error message--------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-86c15e366696> in <module>
----> 1 words_set.most_similar("knit", topn=3)

AttributeError: 'dict' object has no attribute 'most_similar'
---------------error message--------

Solution

  • You've defined word_vectors as a Python dict:

    word_vectors = {}
    

    Then your save_model() function just saves that raw dict, and your load_model() loads that same raw dict.

    Such dictionary objects don't implement the most_similar() method, which is specific to the KeyedVectors interface (& related classes) of gensim.

    So, you'll have to leave the data inside a KeyedVectors-like object to be able to use most_similar().

    Fortunately, you have a few options.

    If you happened to need the just the first 15 words from inside the GoogleNews file (or first 15,000, etc), you could use the optional limit parameter to only read that many vectors:

    from gensim.models import KeyedVectors
    model = KeyedVectors.load_word2vec_format(GOOGLE_WORD2VEC_MODEL, limit=15, binary=True)
    

    Alternatively, if you really need to select an arbitrary subset of the words, and assemble them into a new KeyedVectors instance, you could re-use one of the classes inside gensim instead of a plain dict, then add your vectors in a slightly different way:

    # instead of a {} dict
    word_vectors = KeyedVectors(model.vector_size)  # re-use size from loaded model
    

    ...then later inside your loop of each word you want to add...

    # instead of `word_vectors[word] = _SOMETHING_`
    word_vectors.add(word, model[word])
    

    Then you'll have a word_vectors that is an actual KeyedVectors object. While you could save that via plain Python-pickle, at that point you might as well use the KeyedVectors built-in save() and load() - they may be more efficient on large vector sets (by saving large sets of raw vectors as a separate file which should be kept alongside the main file). For example:

    word_vectors.save(GOOGLE_WORD_FEATURE)
    

    ...

    words_set = KeyedVectors.load(GOOGLE_WORD_FEATURE)
    
    words_set.most_similar("knit", topn=3)  # should work