Stemming on tokenized words

Having this dataset:

>cleaned['text']
0         [we, have, a, month, open, #postdoc, position,...
1         [the, hardworking, biofuel, producers, in, iow...
2         [the, hardworking, biofuel, producers, in, iow...
3         [in, today, s, time, it, is, imperative, to, r...
4         [special, thanks, to, gaetanos, beach, club, o...
                                ...                        
130736    [demand, gw, sources, fossil, fuels, renewable...
130737         [there, s, just, not, enough, to, go, round]
130738    [the, answer, to, deforestation, lies, in, space]
130739    [d, filament, from, plastic, waste, regrind, o...
130740          [gb, grid, is, generating, gw, out, of, gw]
Name: text, Length: 130741, dtype: object

Is there a simple way to stem all the words?

Solution

You may find better answers but I personally find the LemmInflect library to be the best for lemmatization and inflections.

#!pip install lemminflect
from lemminflect import getLemma, getInflection, getAllLemmas

word = 'testing'
lemma = list(lemminflect.getAllLemmas(word, upos='NOUN').values())[0]
inflect = lemminflect.getInflection(lemma[0], tag='VBD')

print(word, lemma, inflect)

testing ('test',) ('tested',)

I would avoid stemming because it's not really useful if you want to work with language models or just text classification with any context. Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.

Inflections are the opposite of a lemma.

sentence = ['I', 'am', 'testing', 'my', 'new', 'library']

def l(sentence):
    lemmatized_sent = []
    for i in sentence:
        try: lemmatized_sent.append(list(getAllLemmas(i, upos='NOUN').values())[0][0])
        except: lemmatized_sent.append(i)
    return lemmatized_sent

l(sentence)

['I', 'be', 'test', 'my', 'new', 'library']

#To apply to dataframe use this
df['sentences'].apply(l)

Do read the documentation for LemmInflect. You can do so much more with it.