Search code examples
pythonnltkwordnet

How can I print out the main lemma of a WordNet synset? Python NLTK


I have a large set of WordNet synsets. A small portion of this set is:

syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}

I want to print out the synset term (the main lemma of the synset) for each synset in the set. For example, the output of the above set should be:

brutal, benignant

This is the code I used:

    from nltk.corpus import wordnet as wn
    for s in syns:
        print(wn.s.lemmas[0])

but this does not work, because s is considered a string, and not an object. I get the following error:

AttributeError: 'WordNetCorpusReader' object has no attribute 's'

This is because s is seen as a string, and not as an object. I tried to change s to byte form like so:

    s = bytes(s)

But that does not work. How can I print out only the lemma as mentioned above, in the simplest way?

I checked here, and this is a good way to do it, but my set of synsets are in string form, and not actually objects.

Thanks in advance..


Solution

  • TL;DR

    >>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
    >>> [wn.synset(i[8:-2]) for i in syns]
    [Synset('benignant.s.02'), Synset('brutal.s.04')]
    >>> syns = [wn.synset(i[8:-2]) for i in syns]
    >>> syns[0].lemma_names()
    [u'benignant', u'gracious']
    

    Firstly to get an input with the type printed out in strings is weird. So the first intuitive approach would be do something like ast.literal_eval() or eval() with the Synset type, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L305 (but before that see http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html):

    >>> from nltk.corpus.reader.wordnet import Synset
    >>> from nltk.corpus import wordnet as wn
    >>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
    >>> [eval(i) for i in syns]
    [Synset('None'), Synset('None')]
    

    Apparently, Synset class won't work independent of the nltk.corpus.wordnet. So we take a look at the wordnet.synset() function instead (https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1217). It seems like it only takes the pre-assigned name of a Synset object, so:

    >>> wn.synset('brutal.s.04')
    Synset('brutal.s.04')
    >>> type(wn.synset('brutal.s.04'))
    <class 'nltk.corpus.reader.wordnet.Synset'>
    

    And after which when the pseudo string synset in your input syns becomes a Synset, you can easily control the Synset as what is shown How do I print out just the word itself in a WordNet synset using Python NLTK?

    Back to your weird input syns, doing the following will give me the name of the synset:

    >>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
    >>> list(syns)[0]
    "Synset('benignant.s.02')"
    >>> list(syns)[0][8:-2]
    'benignant.s.02'
    

    So back to converting it into a Synset:

    >>> syns = {"Synset('brutal.s.04')", "Synset('benignant.s.02')"}
    >>> [wn.synset(i[8:-2]) for i in syns]
    [Synset('benignant.s.02'), Synset('brutal.s.04')]
    >>> syns = [wn.synset(i[8:-2]) for i in syns]
    >>> syns[0].lemma_names()
    [u'benignant', u'gracious']
    

    But let's roll back altogether, you're getting a weird input syns because someone has saved their output by simply casting a str() to a Synset object:

    >>> syns[0]
    Synset('benignant.s.02')
    >>> str(syns[0])
    "Synset('benignant.s.02')"
    

    The person could have simply done:

    >>> syns[0].name()
    u'benignant.s.02'
    

    Which then your input syns object will look like this:

    syns = {u'brutal.s.04', u'benignant.s.02'}
    

    and to read it, you can simply do:

    >>> from nltk.corpus import wordnet as wn
    >>> syns = {u'brutal.s.04', u'benignant.s.02'}
    >>> syns = [wn.synset(i) for i in syns]
    >>> syns[0]
    Synset('brutal.s.04')
    >>> syns[0].lemma_names()
    [u'brutal']