Search code examples
pythonnltkwordnet

WordNet: Iterate over synsets


For a project I would like to measure the amount of ‘human centered’ words within a text. I plan on doing this using WordNet. I have never used it and I am not quite sure how to approach this task. I want to use WordNet to count the amount of words that belong to certain synsets, for example the sysnets ‘human’ and ‘person’.

I came up with the following (simple) piece of code:

word = 'girlfriend'
word_synsets = wn.synsets(word)[0]

hypernyms = word_synsets.hypernym_paths()[0]

for element in hypernyms:
    print element

Results in:

Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')

My first question is, how do I properly iterate over the hypernyms? In the code above it prints them just fine. However, when using an ‘if’ statement, for example:

count_humancenteredness = 0
for element in hypernyms:
    if element == 'person':
        print 'found person hypernym'
        count_humancenteredness +=1

I get ‘AttributeError: 'str' object has no attribute '_name'’. What method can I use to iterate over the hypernyms of my word and perform an action (e.g. increase the count of human centerdness) when a word does indeed belong to the ‘person’ or ‘human’ synset.

Secondly, is this an efficient approach? I assume that iterating over several texts and iterating over the hypernyms of each noun will take quite some time.. Perhaps that there is another way to use WordNet to perform my task more efficiently.

Thanks for your help!


Solution

  • wrt the error message

    hypernyms = word_synsets.hypernym_paths() returns a list of list of SynSets.

    Hence

    if element == 'person':
    

    tries to compare a SynSet object against a string. That kind of comparison is not supported by the SynSet.

    Try something like

    target_synsets = wn.synsets('person')
    if element in target_synsets:
        ...
    

    or

    if u'person' in element.lemma_names():
        ...
    

    instead.

    wrt efficiency

    Currently, you do a hypernym-lookup for every word inside your input text. As you note, this is not necessarily efficient. However, if this is fast enough, stop here and do not optimize what is not broken.

    To speed up the lookup, you can pre-compile a list of "person related" words in advance by making use of the transitive closure over the hyponyms as explained here.

    Something like

    person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())
    

    should do the trick. This will return a set of ~ 10,000 words, which is not too much to store in main memory.

    A simple version of the word counter then becomes something on the lines of

    from collections import Counter
    
    word_count = Counter()
    for word in (w.lower() for w in words if w in person_words):         
        word_count[word] += 1
    

    You might also need to pre-process the input words using stemming or other morphologic reductions before passing the words on to WordNet, though.