I am working on an IR project, I need an alternative to both stemming (which returns unreal words) and lemmatization (which may not change the word at all)
So I looked for a way to get forms of a word.
This python script gives me derivationally_related_forms of a word (e.g. "retrieving"), using NLTK and Wordnet:
from nltk.corpus import wordnet as wn
str = "retrieving"
synsets = wn.synsets(str)
s = set()
result = ""
for synset in synsets:
related = None
lemmas = synset.lemmas()
for lemma in lemmas:
forms = lemma.derivationally_related_forms()
for form in forms:
name = form.name()
s.add(name)
print(list(s))
The output is:
['recollection', 'recovery', 'regaining', 'think', 'retrieval', 'remembering', 'recall', 'recollective', 'thought', 'remembrance', 'recoverer', 'retriever']
But what I really want is only : 'retrieval'
, 'retriever'
, not 'think'
or 'recovery'
...etc
and the result is also missing other forms, such as: 'retrieve'
I know that the problem is that "synsets" include words different from my input word, so I get unrelated derivated forms
Is there a way to get the result I am expecting?
You could do what you currently do, then run a stemmer over the word list you get, and only keep the ones that have the same stem as the word you want.
Another approach, not using Wordnet, is to get a large dictionary that contains all derived forms, then do a fuzzy search on it. I just found this: https://github.com/dwyl/english-words/ (Which links back to this question How to get english language word database? )
The simplest algorithm would be an O(N) linear search, doing Levenshtein Distance on each. Or run your stemmer on each entry.
If efficiency starts to be a concern... well, that is really a new question, but the first idea that comes to mind is you could do a one-off indexing of all entries by the stemmer result.