Search code examples
python-3.xnlpnltkwordnet

Does wordnet python-nltk interface includes any measure of semantic relatedness?


I know that I can use the semantic similarity in the nltk interface using

sim=wn.synset(name_1).path_similarity(wn.synset(name_2))

I also know that I can evaluate the semantic relatedness of words using vector space models and co-ocurrence matrices, but I was not able to find any solution in the nltk interface.


Solution

  • NLTK-WordNet has a host of word similarity algorithms based on the WordNet taxonomy, although none are based on vector space models or co-occurrence matrices.

    from nltk.corpus import wordnet as wn
    from nltk.corpus import wordnet_ic
    
    # Wordnet information content file
    brown_ic = wordnet_ic.ic('ic-brown.dat')
    
    cat = wn.synsets('cat')[0]
    dog = wn.synsets('dog')[0]
    
    
    '''
    Path Similarity:
    Return a score denoting how similar two word senses are,
    based on the shortest path that connects the senses
    in the is-a (hypernym/hypnoym) taxonomy.
    The score is in the range 0 to 1.
    '''
    print(wn.path_similarity(cat, dog))
    # 0.2
    
    '''
    Leacock-Chodorow Similarity:
    Return a score denoting how similar two word senses are,
    based on the shortest path that connects the senses (as above)
    and the maximum depth of the taxonomy in which the senses occur.
    The relationship is given as -log(p/2d)
    where p is the shortest path length and d the taxonomy depth.
    '''
    print(wn.lch_similarity(cat, dog))
    # 2.0281482472922856
    
    '''
    Wu-Palmer Similarity:
    Return a score denoting how similar two word senses are,
    based on the depth of the two senses in the taxonomy
    and that of their Least Common Subsumer (most specific ancestor node).
    '''
    print(wn.wup_similarity(cat, dog))
    # 0.8571428571428571
    
    '''
    Lin Similarity:
    Return a score denoting how similar two word senses are,
    based on the Information Content (IC) of the Least Common Subsumer
    and that of the two input Synsets.
    The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
    '''
    print(wn.lin_similarity(cat, dog, ic=brown_ic))
    # 0.8768009843733973
    
    '''
    Resnik Similarity:
    Return a score denoting how similar two word senses are,
    based on the Information Content (IC) of the Least Common Subsumer
    Note that for any similarity measure that uses information content,
    the result is dependent on the corpus used to generate the information content
    and the specifics of how the information content was created.
    '''
    print(wn.res_similarity(cat, dog, ic=brown_ic))
    # 7.911666509036577
    
    '''
    Jiang-Conrath Similarity
    Return a score denoting how similar two word senses are,
    based on the Information Content (IC) of the Least Common Subsumer
    and that of the two input Synsets.
    The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
    '''
    print(wn.jcn_similarity(cat, dog, ic=brown_ic))
    # 0.4497755285516739