Search code examples
pythonnltkwordnet

Poincare embeddings: building transitive closures from WordNet


I'd like to replicate Figure 2 in Poincaré Embeddings for Learning Hierarchical Representations, namely: Poincare embeddings from the "mammal" subtree of WordNet.

First, I construct the transitive closure needed to represent the graph. Following these docs and this SO answer, I do the following to construct the relations:

from   nltk.corpus import wordnet as wn

root    = wn.synset('mammal.n.01')
words   = list(set([w for s in root.closure(hyponyms) for w in s.lemma_names()]))
rname   = root.name().split('.')[0]
closure = [(word, rname) for word in words]

Then I am using Gensim's Poincare model to compute the embeddings. Given the example relations in Gensim's documentation, e.g.

relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')]

I infer that the hypernym needs to be to the right. Here is the model fitting code:


from   gensim.models.poincare import PoincareModel
from   gensim.viz.poincare import poincare_2d_visualization

model = PoincareModel(relations, size=2, negative=0)
model.train(epochs=50)

fig = poincare_2d_visualization(model, relations, 'WordNet Poincare embeddings')
fig.show()

However, the result is obviously not correct in that it looks nothing like the paper. What am I doing wrong?

Poincare embeddings


Solution

  • I think the main issue here stems from this line:

    closure = [(word, rname) for word in words]
    

    You are generating a list where every word is only connected to rname which is "mammal". That is, you only get ("columbian_mammoth", "mammal") and are missing the intermediate steps ("columbian_mammoth", "mammoth"), ("mammoth", "elephant"), ("elephant", "proboscidean") and so on.

    I suggest a recursive function append_pairs to address this issue. I also fine-tuned the arguments to PoincareModel and poincare_2d_visualization a little bit.

    from nltk.corpus import wordnet as wn
    from gensim.models.poincare import PoincareModel
    from gensim.viz.poincare import poincare_2d_visualization
    
    
    def simple_name(r):
        return r.name().split('.')[0]
    
    
    def append_pairs(my_root, pairs):
        for w in my_root.hyponyms():
            pairs.append((simple_name(w), simple_name(my_root)))
            append_pairs(w, pairs)
        return pairs
    
    
    if __name__ == '__main__':
        root = wn.synset('mammal.n.01')
        words = list(set([w for s in root.closure(lambda s: s.hyponyms()) for w in s.lemma_names()]))
    
        relations = append_pairs(root, [])
    
        model = PoincareModel(relations, size=2, negative=10)
        model.train(epochs=20)
    
        fig = poincare_2d_visualization(model, relations, 'WordNet Poincare embeddings', num_nodes=None)
        fig.show()
    

    enter image description here

    The image is not yet as beautiful as in the original source, but at least you can see the clustering now.