Search code examples
pythongensimword2vecword-embedding

python word2vec context similarity using surrounding words


I would like to use embeddings made by w2v in order to obtain the most likely substitute words GIVEN a context (surrounding words), rather than supplying an individual word.

Example: sentence = 'I would like to go to the park tomorrow after school'

If I want to find candidates similar to "park", typically I would just leverage the similarity function from the Gensim model

model.most_similar('park')

and obtain semantically similar words. However this could give me similar words to the verb 'park' instead of the noun 'park', which I was after.

Is there any way to query the model and give it surrounding words as context to provide better candidates?


Solution

  • Word2vec is not, primarily, a word-prediction algorithm. Internally it tries to do semi-predictions, to train its word-vectors, but usually these training-predictions aren't the end-use for which word-vectors are wanted.

    That said, recent versions of gensim added a predict_output_word() method that (for some model modes) approximates the predictions done during training. It might be useful for your purposes.

    Alternatively, checking for the words most_similar() to your initial target word that are also somewhat-similar to the context words might help.

    There have been some research papers about ways to disambiguate multiple word senses (like 'to /park/ a car' versus 'walk in a /park/') during word-vector training, but I haven't seen them implemented in open source libraries.