From the semcor corpus (http://www.cse.unt.edu/~rada/downloads.html), there are senses wasn't mapped to the later versions of wordnet. And magically, the mapping can be found in the NLTK WordNet API as such:
>>> from nltk.corpus import wordnet as wn
# Emunerate the possible senses for the lemma 'delayed'
>>> wn.synsets('delayed')
[Synset('delay.v.01'), Synset('delay.v.02'), Synset('stay.v.06'), Synset('check.v.07'), Synset('delayed.s.01')]
>>> wn.synset('delay.v.01')
Synset('delay.v.01')
# Magically, there is a 0th sense of the word!!!
>>> wn.synset('delayed.a.0')
Synset('delayed.s.01')
I've checked the code and the API (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet.Synset-class.html, http://nltk.org/_modules/nltk/corpus/reader/wordnet.html) but i can't find how they did the magically mapping that didn't shouldn't exist (e.g. for delayed.a.0
-> delayed.s.01
).
Does anyone know which part of the NLTK Wordnet API code does the magical mapping?
It's a bug I guess. When you do wn.synset('delayed.a.0')
the first two lines in the method are:
lemma, pos, synset_index_str = name.lower().rsplit('.', 2)
synset_index = int(synset_index_str) - 1
So in this case the value of synset_index
is -1
which is a valid index in python. And it won't fail when looking up in the array of synsets whose lemma
is delayed
and pos
is a
.
With this behavior you can do tricky things like:
>>> wn.synset('delay.v.-1')
Synset('stay.v.06')