Search code examples
tensorflowword2vec

word2vec - get nearest words


Reading the tensorflow word2vec model output how can I output the words related to a specific word ?

Reading the src : https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/word2vec/word2vec_basic.py can view how the image is plotted.

But is there a data structure (e.g dictionary) created as part of training the model that allows to access nearest n words closest to given word ? For example if word2vec generated image :

enter image description here

image src: https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html

In this image the words 'to , he , it' are contained in same cluster, is there a function which takes as input 'to' and outputs 'he , it' (in this case n=2) ?


Solution

  • This approach apply to word2vec in general. If you can save the word2vec in text/binary file like google/GloVe word vector. Then what you need is just the gensim.

    To install:

    Via github

    Python code:

    from gensim.models import Word2Vec
    
    gmodel=Word2Vec.load_word2vec_format(fname)
    ms=gmodel.most_similar('good',10)
    for x in ms:
        print x[0],x[1]
    

    However this will search all the words to give the results, there are approximate nearest neighbor (ANN) which will give you the result faster but with a trade off in accuracy.

    In the latest gensim, annoy is used to perform the ANN, see this notebooks for more information.

    Flann is another library for Approximate Nearest Neighbors.