Search code examples
pythonnlpnltksimilaritywordnet

How to calculate shortest paths between all pairs of nouns in a group with NLTK, WordNet, and similarity?


I am trying to calculate shortest paths between all pairs of nouns in a group. I have many such groups of nouns with different group sizes. The biggest group contains about 250 nouns. The input is a txt file with nouns, each on a new line. The output as a txt file should list of all pairs of nouns with corresponding shortest paths.

I am new to python and NLTK, and after a lot of searching here and other sources, many trials and errors, this is the code I came up:

import nltk
from nltk.corpus import wordnet as wn

listSim = []
with open("words-input.txt", "rU") as wordList1:
    myList1 = [line.rstrip('\n') for line in wordList1]
    for word1 in myList1:
        with open("words-input2.txt", "rU") as wordList2:
            myList2 = [line.rstrip('\n') for line in wordList2]
            for word2 in myList2:
                wordFromList1 = wn.synsets(word1)
                wordFromList2 = wn.synsets(word2)
                if wordFromList1 and wordFromList2:
                    s = 1/(wordFromList1[0].path_similarity(wordFromList2[0]))
                    sym = (word1, word2, s)
                    listSim.append(sym)

print (listSim)
with open("words-output.txt", "w") as text_file:
    print (listSim, file=text_file)

(Just to note, I could not successfully iterate the same txt file, so I made a duplicate, and the ‘words-input.txt’ and ‘words-input2.txt’ in the above code contain the same group of nouns in the same order.)

The problem with my code is that it only calculates the shortest path between the first synsets (first meaning - n#1) of the nouns. For example, if the shortest path appears to be between n#3 of noun1 and n#5 of noun2, this is the number I have to output (or reciprocal of it, representing the number of steps on this path).

Help or advice how to do this will be greatly appreciated.


Solution

  • The following should do for you, I only provide the relevant parts.

    from itertools import product
    
    for word1 in myList1:
        for word2 in myList2:
            list1 = wn.synsets(word1)
            list2 = wn.synsets(word2)
    
            sList = [ss1.path_similarity(ss2) for ss1, ss2 in product(list1, list2)]
    
            best = sorted(sList, reverse=True)[0]
            listSim.append((word1, word2, best))