Search code examples
pythonnltkwordnet

Wordnet: Getting derivationally_related_forms of a word


I am working on an IR project, I need an alternative to both stemming (which returns unreal words) and lemmatization (which may not change the word at all)

So I looked for a way to get forms of a word.

This python script gives me derivationally_related_forms of a word (e.g. "retrieving"), using NLTK and Wordnet:

from nltk.corpus import wordnet as wn    

str = "retrieving"

synsets = wn.synsets(str)

s = set()
result = ""
for synset in synsets:
    related = None
    lemmas = synset.lemmas()
    for lemma in lemmas:
        forms = lemma.derivationally_related_forms()
        for form in forms:
            name = form.name()
            s.add(name)    

print(list(s))

The output is:

['recollection', 'recovery', 'regaining', 'think', 'retrieval', 'remembering', 'recall', 'recollective', 'thought', 'remembrance', 'recoverer', 'retriever']

But what I really want is only : 'retrieval' , 'retriever' , not 'think' or 'recovery'...etc

and the result is also missing other forms, such as: 'retrieve'

I know that the problem is that "synsets" include words different from my input word, so I get unrelated derivated forms

Is there a way to get the result I am expecting?


Solution

  • You could do what you currently do, then run a stemmer over the word list you get, and only keep the ones that have the same stem as the word you want.

    Another approach, not using Wordnet, is to get a large dictionary that contains all derived forms, then do a fuzzy search on it. I just found this: https://github.com/dwyl/english-words/ (Which links back to this question How to get english language word database? )

    The simplest algorithm would be an O(N) linear search, doing Levenshtein Distance on each. Or run your stemmer on each entry.

    If efficiency starts to be a concern... well, that is really a new question, but the first idea that comes to mind is you could do a one-off indexing of all entries by the stemmer result.