WordNet: Iterate over synsets

For a project I would like to measure the amount of ‘human centered’ words within a text. I plan on doing this using WordNet. I have never used it and I am not quite sure how to approach this task. I want to use WordNet to count the amount of words that belong to certain synsets, for example the sysnets ‘human’ and ‘person’.

I came up with the following (simple) piece of code:

word = 'girlfriend'
word_synsets = wn.synsets(word)[0]

hypernyms = word_synsets.hypernym_paths()[0]

for element in hypernyms:
    print element

Results in:

Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')

My first question is, how do I properly iterate over the hypernyms? In the code above it prints them just fine. However, when using an ‘if’ statement, for example:

count_humancenteredness = 0
for element in hypernyms:
    if element == 'person':
        print 'found person hypernym'
        count_humancenteredness +=1

I get ‘AttributeError: 'str' object has no attribute '_name'’. What method can I use to iterate over the hypernyms of my word and perform an action (e.g. increase the count of human centerdness) when a word does indeed belong to the ‘person’ or ‘human’ synset.

Secondly, is this an efficient approach? I assume that iterating over several texts and iterating over the hypernyms of each noun will take quite some time.. Perhaps that there is another way to use WordNet to perform my task more efficiently.

Thanks for your help!

Solution

wrt the error message

hypernyms = word_synsets.hypernym_paths() returns a list of list of SynSets.

Hence

if element == 'person':

tries to compare a SynSet object against a string. That kind of comparison is not supported by the SynSet.

Try something like

target_synsets = wn.synsets('person')
if element in target_synsets:
    ...

if u'person' in element.lemma_names():
    ...

instead.

wrt efficiency

Currently, you do a hypernym-lookup for every word inside your input text. As you note, this is not necessarily efficient. However, if this is fast enough, stop here and do not optimize what is not broken.

To speed up the lookup, you can pre-compile a list of "person related" words in advance by making use of the transitive closure over the hyponyms as explained here.

Something like

person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())

should do the trick. This will return a set of ~ 10,000 words, which is not too much to store in main memory.

A simple version of the word counter then becomes something on the lines of

from collections import Counter

word_count = Counter()
for word in (w.lower() for w in words if w in person_words):         
    word_count[word] += 1

You might also need to pre-process the input words using stemming or other morphologic reductions before passing the words on to WordNet, though.