Search code examples
pythonpython-3.xwordnettext-classification

Reducing the output of Wordnet to one meaning


First off, let me introduce you to my problem: for a project I have to classify 8000 questions and put them into 7 categories (constitution, sports, geography, history, science, education and tech). Because the questions are very short SVM's don't make much sense, so I just created a list of words for every category. To improve accuracy I have to expand these lists, so unlabeled strings can be put into categories. On the internet I heard about WordNet to get synonyms of words (which makes sense for me, because I need as many synonyms for my words as possible). But here comes the problem: WordNet shows under

from nltk.corpus import wordnet as wn
for synset in wn.synsets(word):
    for lemma in synset.lemmas():
        print(lemma.name())

all the related words. An example is the word capital: I just mean capital in the sense of the capital city of a country, but WordNet returns the words capital, working, capital letter, upper case, upper-case, majuscule and Capital Washington. Obviously, I don't need the word upper-case in a bag of words for geography. So I ask you if there is any possibility to reduce WordNet to only one meaning or if there is any alternative that I can use.

Sincerely, James


Solution

  • You need to find the synonyms for a specific lemma (canonical dictionary entry; a word with a single definition). I'll simply include the link I posted in the comments, and wish you good luck.