I have the predict_output_word method from the official github repository. which takes only wod2vec models trained with skip-gram and tries to predict the middle word by summing the vectors of all the input word's indices and divids this by the length of np_sum of the input word indices. Then you consider output and take softmax to get probabilities of the predicted word after you sum all these probabilities to get the most likely word. Is there a better way to approach this in other to get better words since this gives very bad results for shorter sentences. below is the code from github.
def predict_output_word(model, context_words_list, topn=10):
from numpy import exp, dtype, float32 as REAL,\
ndarray, empty, sum as np_sum,
from gensim import utils, matutils
"""Report the probability distribution of the center word given the context words as input to the trained model."""
if not model.negative:
raise RuntimeError("We have currently only implemented predict_output_word "
"for the negative sampling scheme, so you need to have "
"run word2vec with negative > 0 for this to work.")
if not hasattr(model.wv, 'syn0') or not hasattr(model, 'syn1neg'):
raise RuntimeError("Parameters required for predicting the output words not found.")
word_vocabs = [model.wv.vocab[w] for w in context_words_list if w in model.wv.vocab]
if not word_vocabs:
warnings.warn("All the input context words are out-of-vocabulary for the current model.")
return None
word2_indices = [word.index for word in word_vocabs]
#sum all the indices
l1 = np_sum(model.wv.syn0[word2_indices], axis=0)
if word2_indices and model.cbow_mean:
#l1 = l1 / len(word2_indices)
l1 /= len(word2_indices)
prob_values = exp(dot(l1, model.syn1neg.T)) # propagate hidden -> output and take softmax to get probabilities
prob_values /= sum(prob_values)
top_indices = matutils.argsort(prob_values, topn=topn, reverse=True)
return [(model.wv.index2word[index1], prob_values[index1]) for index1 in top_indices] #returning the most probable output words with their probabilities
While the word2vec algorithm trains word-vectors by trying to predict words, and then those word-vectors may be useful for other purposes, it is not likely to be the ideal algorithm if word-prediction is your real goal.
Most word2vec implementations haven't even offered a specific interface for individual word-predictions. In gensim, predict_output_word()
was only added recently. It only works for some modes. It doesn't quite treat the window
the same as during training – there's no effective weighting-by-distance. And, it is fairly expensive – essentially checking the model's prediction for every word, then reporting the top-N. (The 'prediction' that occurs during training is 'sparse' and much more efficient - just running enough of the model to nudge it to be better at a single example.)
If word-prediction is your real goal, you may get better results from other methods, including just calculating a big lookup-table of how-often words appear near each-other or near other n-grams.