I was trying to use Gensim to import GoogelNews-pretrained model on some English words (sampled 15 ones here only stored in a txt file with each per line, and there are no more context as corpus). Then I could use "model.most_similar()" to get their similar words/phrases for them. But actually the file loaded from Python-Pickle method couldn't be used for gensim-built-in model.load()
and model.most_similar()
function directly.
how should I do to cluster the 15 English words (and more in the future), since I couldn't train and save and load a model from the beginning?
import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
GOOGLE_WORD2VEC_MODEL = '../GoogleNews-vectors-negative300.bin'
GOOGLE_ENGLISH_WORD_PATH = '../testwords.txt'
GOOGLE_WORD_FEATURE = '../word.google.vector'
model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_WORD2VEC_MODEL, binary=True)
word_vectors = {}
#load 15 words as a test to word_vectors
with open(GOOGLE_ENGLISH_WORD_PATH) as f:
lines = f.readlines()
for line in lines:
line = line.strip('\n')
if line:
word = line
print(line)
word_vectors[word]=None
try:
import cPickle
except :
import _pickle as cPickle
def save_model(clf,modelpath):
with open(modelpath, 'wb') as f:
cPickle.dump(clf, f)
def load_model(modelpath):
try:
with open(modelpath, 'rb') as f:
rf = cPickle.load(f)
return rf
except Exception as e:
return None
for word in word_vectors:
try:
v= model[word]
word_vectors[word] = v
except:
pass
save_model(word_vectors,GOOGLE_WORD_FEATURE)
words_set = load_model(GOOGLE_WORD_FEATURE)
words_set.most_similar("knit", topn=3)
---------------error message-------- AttributeError Traceback (most recent call last) <ipython-input-8-86c15e366696> in <module> ----> 1 words_set.most_similar("knit", topn=3) AttributeError: 'dict' object has no attribute 'most_similar' ---------------error message--------
You've defined word_vectors
as a Python dict
:
word_vectors = {}
Then your save_model()
function just saves that raw dict
, and your load_model()
loads that same raw dict
.
Such dictionary objects don't implement the most_similar()
method, which is specific to the KeyedVectors
interface (& related classes) of gensim
.
So, you'll have to leave the data inside a KeyedVectors
-like object to be able to use most_similar()
.
Fortunately, you have a few options.
If you happened to need the just the first 15 words from inside the GoogleNews
file (or first 15,000, etc), you could use the optional limit
parameter to only read that many vectors:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(GOOGLE_WORD2VEC_MODEL, limit=15, binary=True)
Alternatively, if you really need to select an arbitrary subset of the words, and assemble them into a new KeyedVectors
instance, you could re-use one of the classes inside gensim
instead of a plain dict
, then add your vectors in a slightly different way:
# instead of a {} dict
word_vectors = KeyedVectors(model.vector_size) # re-use size from loaded model
...then later inside your loop of each word
you want to add...
# instead of `word_vectors[word] = _SOMETHING_`
word_vectors.add(word, model[word])
Then you'll have a word_vectors
that is an actual KeyedVectors
object. While you could save that via plain Python-pickle, at that point you might as well use the KeyedVectors
built-in save()
and load()
- they may be more efficient on large vector sets (by saving large sets of raw vectors as a separate file which should be kept alongside the main file). For example:
word_vectors.save(GOOGLE_WORD_FEATURE)
...
words_set = KeyedVectors.load(GOOGLE_WORD_FEATURE)
words_set.most_similar("knit", topn=3) # should work