Search code examples
pythonnlpnltkpos-taggerword-sense-disambiguation

I am having problems doing Word Sense Disambiguation in Python using Lesk algorithm


I am new to Python and NLTK so please bear with me. I wish to find the sense of a word in the context of a sentence. I am using the Lesk WSD algorithm but it is giving different outputs every time I run it. I know that Lesk has some level of inaccuracy. But, I think a POS tag will increase accuracy.

The Lesk algorithm takes a POS tag as an argument, but it takes 'n','s','v' as an input and not 'NN','VBP' or other POS tags which are outputted by the pos_tag() function. I would like to know how to tag words in the form of 'n','s','v', or if there is a method in which I can convert the 'NN','VBP' and other tags into 'n','s','v', so I can give them as an input to the lesk(context_sentence,word,pos_tag) function.

I am calculating the sentiment score of every word using SentiWordNet afterwards.

    from nltk.wsd import lesk
    from nltk import word_tokenize
    import nltk, re, pprint
    from nltk.corpus import sentiwordnet as swn

    def word_sense():

        sent = word_tokenize("He should be happy.")
        word = "be"
        pos = "v"
        score = lesk(sent,word,pos)
        print(score)
        print (str(score),type(score))
        set1 = re.findall("'([^']*)'",str(score))[0]
        print (set1)
        bank = swn.senti_synset(str(set1))
        print (bank)

    word_sense()

Solution

  • nltk.wsd.lesk does not return score, it returns the predicted Synset:

    >>> from nltk.corpus import wordnet as wn
    >>> from nltk.corpus import sentiwordnet as swn
    >>> from nltk import word_tokenize
    >>> from nltk.wsd import lesk
    >>> sent = word_tokenize("He should be happy".lower())
    >>> lesk(sent, 'be', 'v')
    Synset('equal.v.01')
    

    lesk is not perfect, it should only be used as a baseline system for WSD.

    Although this is nice:

    >>> ss = str(lesk(sent, 'be', 'v'))
    >>> re.findall("'([^']*)'",ss)
    ['equal.v.01']
    

    There's a simpler to get the synset identifier:

    >>> lesk(sent, 'be', 'v').name()
    u'equal.v.01'
    

    Then you can do:

    >>> swn.senti_synset(lesk(sent, 'be', 'v').name())
    SentiSynset('equal.v.01')
    

    To convert POS tag to WN POS, you can simply try: Converting POS tags from TextBlob into Wordnet compatible inputs