Search code examples
pythonpandasstemming

Stemming a pandas dataframe


I have tweet dataset (taken from NLTK) which is currently in a pandas dataframe, but I need to stem it. I have tried many different ways and get some different errors, such as

AttributeError: 'Series' object has no attribute 'lower'
and
KeyError: 'text'

I dont understand the KeyError as the column is definitely called 'text', however I understand that I need to change the dataframe to a string in order for the stemmer to work (I think).

Here is an example of the data

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

negative_tweets = twitter_samples.strings('negative_tweets.json')

negtweetsdf = DataFrame(negative_tweets,columns=['text'])

print(stemmer.stem(negtweetstr))

Solution

  • You need to apply the stemming function to the series as follows

    negtweetsdf.apply(stemmer.stem)
    

    This will create a new series.

    Functions that expect a single string value or similar will not simply work on a pandas dataframe or series. They need to be applied to the entire series, which is why .apply is used.

    Here is a worked example with lists inside a dataframe column.

    from nltk.stem.snowball import SnowballStemmer
    from nltk.tokenize import TweetTokenizer
    stemmer = SnowballStemmer("english")
    import pandas as pd
    
    df = pd.DataFrame([['some extremely exciting tweet'],['another']], columns=['tweets'])
    
    # put the strings into lists
    df = pd.DataFrame(df.apply(list,axis=1), columns=['tweets'])
    
    # for each row (apply) for each item in the list, apply the stemmer
    # return a list containing the stems
    df['tweets'].apply(lambda x: [stemmer.stem(y) for y in x])