Search code examples
pythonnlpstemming

Stemming on tokenized words


Having this dataset:

>cleaned['text']
0         [we, have, a, month, open, #postdoc, position,...
1         [the, hardworking, biofuel, producers, in, iow...
2         [the, hardworking, biofuel, producers, in, iow...
3         [in, today, s, time, it, is, imperative, to, r...
4         [special, thanks, to, gaetanos, beach, club, o...
                                ...                        
130736    [demand, gw, sources, fossil, fuels, renewable...
130737         [there, s, just, not, enough, to, go, round]
130738    [the, answer, to, deforestation, lies, in, space]
130739    [d, filament, from, plastic, waste, regrind, o...
130740          [gb, grid, is, generating, gw, out, of, gw]
Name: text, Length: 130741, dtype: object

Is there a simple way to stem all the words?


Solution

  • You may find better answers but I personally find the LemmInflect library to be the best for lemmatization and inflections.

    #!pip install lemminflect
    from lemminflect import getLemma, getInflection, getAllLemmas
    
    word = 'testing'
    lemma = list(lemminflect.getAllLemmas(word, upos='NOUN').values())[0]
    inflect = lemminflect.getInflection(lemma[0], tag='VBD')
    
    print(word, lemma, inflect)
    
    testing ('test',) ('tested',)
    

    I would avoid stemming because it's not really useful if you want to work with language models or just text classification with any context. Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.

    Inflections are the opposite of a lemma.


    sentence = ['I', 'am', 'testing', 'my', 'new', 'library']
    
    def l(sentence):
        lemmatized_sent = []
        for i in sentence:
            try: lemmatized_sent.append(list(getAllLemmas(i, upos='NOUN').values())[0][0])
            except: lemmatized_sent.append(i)
        return lemmatized_sent
    
    l(sentence)
    
    ['I', 'be', 'test', 'my', 'new', 'library']
    
    #To apply to dataframe use this
    df['sentences'].apply(l)
    

    Do read the documentation for LemmInflect. You can do so much more with it.