Search code examples
nlpnltktext-miningstemming

Get the word from stem (stemming)


I am using porter stemmer as follows to get the stem of my words.

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

Now, I want to know the possibility of some word from the stem to make it readable. For example environ to environment or educ to education etc. Is it possible to do?


Solution

  • What, so you want to take a stem and map it to a list of possible words in a dictionary that stem back to it?

    This is difficult because the stemming process is lossy and because it's not a 1:1 transformation.

    That said, in some cases like environ -> {environment, environments, environmental} and educ -> {educate, educational, education, educated, educating} you can get by with a trie structure where you do a prefix lookup. Things get more interesting for stems like happi which has to map back to happy

    In the general case, you would have to start with a dictionary and then produce an inverted index by stemming each word and mapping the stem back to the source word in the index. Using the inverted index you can then look up matches given a stem.

    Hope this helps..