Search code examples
pythonnltkstemminglemmatizationtextblob

Why my output return in a strip-format and cannot be lemmatized/stemmed in Python?


First step is tokenizing the text from dataframe using NLTK. Then, I create a spelling correction using TextBlob. For this, I convert the output from tuple to string. After that, I need to lemmatize/stem (using NLTK). The problem is my output return in a strip-format. Thus, it cannot be lemmatized/stemmed.

#create a dataframe
import pandas as pd
import nltk
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})

#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
    return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(token)

#spelling correction
def spell_eng(text):
  text=TextBlob(str(text)).correct()
  #convert from tuple to str
  text=functools.reduce(operator.add, (text))
  return text
df['text3'] = df['text2'].apply(spell_eng)


#lemmatization/stemming
def stem_eng(text):
   lemmatizer = nltk.stem.WordNetLemmatizer()
   return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)

Generated output: enter image description here

Desired output:

text4
--------------
[spell]
[be]
[work,cook,listen]
[study]

Solution

  • I got where the problem is, the dataframes are storing these arrays as a string. So, the lemmatization is not working. Also note that, it is from the spell_eng part.

    I have written a solution, which is a slight modification for your code.

    import pandas as pd
    import nltk
    from textblob import TextBlob
    import functools
    import operator
    
    df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
    
    #tokenization
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    def tokenize(text):
        return [w for w in w_tokenizer.tokenize(text)]
    df["text2"] = df["text"].apply(tokenize)
    
    
    # spelling correction
    def spell_eng(text):
        text = [TextBlob(str(w)).correct() for w in text] #CHANGE
        #convert from tuple to str
        text = [functools.reduce(operator.add, (w)) for w in text] #CHANGE
        return text
    
    df['text3'] = df['text2'].apply(spell_eng)
    
    
    # lemmatization/stemming
    def stem_eng(text):
        lemmatizer = nltk.stem.WordNetLemmatizer()
        return [lemmatizer.lemmatize(w,'v') for w in text] 
    df['text4'] = df['text3'].apply(stem_eng)
    df['text4']
    

    Hope these things help.