Search code examples
kerasword-embedding

How to find similar words in Keras Word Embedding layer


From Stanford's CS244N course, I know Gensim provides a fantastic method to play around the embedding data: most_similar

I was trying to find some equivalent in Keras Embedding layer but I couldn't. It isn't possible out of the box from Keras? Or was it any wrapper on top of it?


Solution

  • A simple implementation would be:

    def most_similar(emb_layer, pos_word_idxs, neg_word_idxs=[], top_n=10):
        weights = emb_layer.weights[0]
    
        mean = []
        for idx in pos_word_idxs:
            mean.append(weights.value()[idx, :])
    
        for idx in neg_word_idxs:
            mean.append(weights.value()[idx, :] * -1)
    
        mean = tf.reduce_mean(mean, 0)
    
        dists = tf.tensordot(weights, mean, 1)
        best = tf.math.top_k(dists, top_n)
    
        # Mask words used as pos or neg
        mask = []
        for v in set(pos_word_idxs + neg_word_idxs):
            mask.append(tf.cast(tf.equal(best.indices, v), tf.int8))
        mask = tf.less(tf.reduce_sum(mask, 0), 1)
    
        return tf.boolean_mask(best.indices, mask), tf.boolean_mask(best.values, mask)
    

    Of course you need to know the indices of the words. I assume you have a word2idx mapping, so you can get them like this: [word2idx[w] for w in pos_words].

    To use it:

    # Assuming the first layer is the Embedding and you are interested in word with idx 10
    idxs, vals = most_similar(model.layers[0], [10])
    
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        sess.run(init)
        idxs = sess.run(idxs)
        vals = sess.run(vals)
    

    Some potential improvements for that function:

    • Make sure it returns top_n words (after the mask it returns less words)
    • gensim uses normalised embeddings (L2_norm)