I have a list of words. I look up each of these words in WordNet and select the first synset. This first synset displays correctly on my terminal (for example : Synset('prior.n.01')). Then, I try to use a replacement regex on that converted string. The expected output is 'prior.n.01'. But what I get is those square boxes with numbers in them. Since my terminal can display the string before it goes through the replacement, I'm guessing the problem doesn't come from that. So, is there something wrong with this regex? Is it because I'm using it on a string which was originally a list element?
Here's the code I'm using:
import re
import nltk
from nltk.corpus import wordnet as wn
word_list = ['prior','indication','link','linked','administered','foobar']
for word in word_list:
synset_list = wn.synsets(word) #returns a list of all synsets for a word
if synset_list == []: #break if word in list isn't in dictionary (empty list)
break
else:
first_synset = str(synset_list[0]) #returns Synset('prior.n.01') as string
print first_synset
clean_synset = re.sub(r'Synset\((.+)\)',r'\1',first_synset) #expected output: 'prior.n.01'
print clean_synset
There is actually a Synset.name()
function to extract the synset name:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')[0].name()
u'dog.n.01'
Also there's a Synset.unicode_repr()
which is useful to avoid any encoding/bytecode problems. Going back to the regex:
>>> x = wn.synsets('dog')[0].unicode_repr()
>>> re.sub(r'Synset\((.+)\)','\1',x)
u'\x01'
>>> re.sub(r'Synset\((.+)\)','1',x)
u'1'
>>> re.sub(r'Synset\((.+)\)','\\1',x)
u"'dog.n.01'"
>>> re.sub(r"Synset\(\'(.+)\'\)",'\\1',x)
u'dog.n.01'