Search code examples
pythonnlpnltk

Getting all leaf words (reverse stemming) into one Python List


On the same lines as the solution provided in this link, I am trying to get all leaf words of one stem word. I am using the community-contributed (@Divyanshu Srivastava) package get_word_forms

Imagine I have a shorter sample word list as follows:

my_list = [' jail', ' belief',' board',' target', ' challenge', ' command']

If I work it manually, I do the following (which is go word-by-word, which is very time-consuming if I have a list of 200 words):

get_word_forms("command")

and get the following output:

{'n': {'command',
  'commandant',
  'commandants',
  'commander',
  'commanders',
  'commandership',
  'commanderships',
  'commandment',
  'commandments',
  'commands'},
 'a': set(),
 'v': {'command', 'commanded', 'commanding', 'commands'},
 'r': set()}

'n' is noun, 'a' is adjective, 'v' is verb, and 'r' is adverb.

If I try to reverse-stem the entire list in one go:

[get_word_forms(word) for word in sample]

I fail at getting any output:

[{'n': set(), 'a': set(), 'v': set(), 'r': set()},
 {'n': set(), 'a': set(), 'v': set(), 'r': set()},
 {'n': set(), 'a': set(), 'v': set(), 'r': set()},
 {'n': set(), 'a': set(), 'v': set(), 'r': set()},
 {'n': set(), 'a': set(), 'v': set(), 'r': set()},
 {'n': set(), 'a': set(), 'v': set(), 'r': set()},
 {'n': set(), 'a': set(), 'v': set(), 'r': set()}]

I think I am failing at saving the output to the dictionary. Eventually, I would like my output to be a list without breaking it down into noun, adjective, adverb, or verb:

something like:

['command','commandant','commandants',  'commander', 'commanders', 'commandership',
'commanderships','commandment', 'commandments', 'commands','commanded', 'commanding', 'commands', 'jail', 'jailer', 'jailers', 'jailor', 'jailors', 'jails', 'jailed', 'jailing'.....] .. and so on. 

Solution

  • One solution using nested list comprehensions after stripping forgotten spaces:

    all_words = [setx for word in my_list for setx in get_word_forms(word.strip()).values() if len(setx)]
    
    # Flatten the list of sets
    all_words = [word for setx in all_words for word in setx]
    
    # Remove the repetitions and sort the set
    all_words = sorted(set(all_words))
    print(all_words)
    
    ['belief', 'beliefs', 'believabilities', 'believability', 'believable', 'believably', 'believe', 'believed', 'believer', 'believers', 'believes', 'believing', 'board', 'boarded', 'boarder', 'boarders', 'boarding', 'boards', 'challenge', 'challengeable', 'challenged', 'challenger', 'challengers', 'challenges', 'challenging', 'command', 'commandant', 'commandants', 'commanded', 'commander', 'commanders', 'commandership', 'commanderships', 'commanding', 'commandment', 'commandments', 'commands', 'jail', 'jailed', 'jailer', 'jailers', 'jailing', 'jailor', 'jailors', 'jails', 'target', 'targeted', 'targeting', 'targets']