Search code examples
nlpdatasetnltkwordnet

Is it possible to get classes on the WordNet dataset?


I am playing with WordNet and try to solve a NLP task.

I was wondering if there exists any way to get a list of words belonging to some large sets, such as "animals" (i.e. dog, cat, cow etc.), "countries", "electronics" etc.

I believe that it should be possible to somehow get this list by exploiting hypernyms.

Bonus question: do you know any other way to classify words in very large classes, besides "noun", "adjective" and "verb"? For example, classes like, "prepositions", "conjunctions" etc.


Solution

  • With some help from polm23, I found this solution, which exploits similarity between words, and prevents wrong results when the class name is ambiguous. The idea is that WordNet can be used to compare a list words, with the string animal, and compute a similarity score. From the nltk.org webpage:

    Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

    def keep_similar(words, similarity_thr):
        similar_words=[]
        w2 = wn.synset('animal.n.01')
    
        [similar_words.append(word) for word in words if wn.synset(word + '.n.01').wup_similarity(w2) > similarity_thr ]
        return similar_words
    

    For example, if word_list = ['dog', 'car', 'train', 'dinosaur', 'London', 'cheese', 'radon'], the corresponding scores are:

    0.875
    0.4444444444444444
    0.5
    0.7
    0.3333333333333333
    0.3076923076923077
    0.3076923076923077
    

    This can easily be used to generate a list of animals, by setting a proper value of similarity_thr