Execute nltk.stem.SnowballStemmer in pandas

I have a four column DataFrame with two columns of tokenized words that have had stop words removed and converted to lower case and am now attempting to stem.

I'm not sure if the apply() method accesses the series plus its individual cells or if I need another way of stepping into each record so tried both (I think!)

from nltk.stem import SnowballStemmer
stemmer = nltk.stem.SnowballStemmer('english')

I've tried:

df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(item) for item in x)

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in () ----> 1 df_2['Headline__'] = df_2['Headline'].apply(lambda x: stemmer.stem(item) for item in x)

~\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values -> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

TypeError: 'generator' object is not callable

I believe this TypeError is similar to the one that says 'List' object is not callable and fixed that one with the apply() method and out of ideas here.

df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(x))

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(x)) 2 3 df_2.head()

~\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values -> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

in (x) ----> 1 df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(x)) 2 3 df_2.head()

~\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\nltk\stem\snowball.py in stem(self, word) 1415 1416 """ -> 1417 word = word.lower() 1418 1419 if word in self.stopwords or len(word) <= 2:

AttributeError: 'list' object has no attribute 'lower'

Solution

You need to specify the axis for the apply.

Here is a full working example:

import pandas as pd

df = pd.DataFrame({
    'col_1' : [['ducks'], ['dogs']],
    'col_2' : [['he', 'eats', 'apples'], ['she', 'has', 'cats', 'dogs']],
    'col_3' : ['some data 1', 'some data 2'],
    'col_4' : ['another data 1', 'another data 2']
})
df.head()

Output

    col_1   col_2                   col_3       col_4
0   [ducks] [he, eats, apples]      some data 1 another data 1
1   [dogs]  [she, has, cats, dogs]  some data 2 another data 2

Now let us apply stemming for the tokenized columns:

import nltk
from nltk.stem import SnowballStemmer
stemmer = nltk.stem.SnowballStemmer('english')

df.col_1 = df.apply(lambda row: [stemmer.stem(item) for item in row.col_1], axis=1)
df.col_2 = df.apply(lambda row: [stemmer.stem(item) for item in row.col_2], axis=1)

Check the new content of the dataframe.

df.head()

Output

    col_1   col_2                   col_3       col_4
0   [duck]  [he, eat, appl]         some data 1 another data 1
1   [dog]   [she, has, cat, dog]    some data 2 another data 2