Search code examples
pythonpandasnlpnltk

Execute nltk.stem.SnowballStemmer in pandas


I have a four column DataFrame with two columns of tokenized words that have had stop words removed and converted to lower case and am now attempting to stem.

enter image description here

I'm not sure if the apply() method accesses the series plus its individual cells or if I need another way of stepping into each record so tried both (I think!)

from nltk.stem import SnowballStemmer
stemmer = nltk.stem.SnowballStemmer('english')

I've tried:

df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(item) for item in x)

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in () ----> 1 df_2['Headline__'] = df_2['Headline'].apply(lambda x: stemmer.stem(item) for item in x)

~\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values -> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

TypeError: 'generator' object is not callable

I believe this TypeError is similar to the one that says 'List' object is not callable and fixed that one with the apply() method and out of ideas here.

df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(x))

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(x)) 2 3 df_2.head()

~\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 3192
else: 3193 values = self.astype(object).values -> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype) 3195 3196 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

in (x) ----> 1 df_2['Headline'] = df_2['Headline'].apply(lambda x: stemmer.stem(x)) 2 3 df_2.head()

~\AppData\Local\Continuum\anaconda3\envs\learn-env\lib\site-packages\nltk\stem\snowball.py in stem(self, word) 1415 1416 """ -> 1417 word = word.lower() 1418 1419 if word in self.stopwords or len(word) <= 2:

AttributeError: 'list' object has no attribute 'lower'


Solution

  • You need to specify the axis for the apply.

    Here is a full working example:

    import pandas as pd
    
    df = pd.DataFrame({
        'col_1' : [['ducks'], ['dogs']],
        'col_2' : [['he', 'eats', 'apples'], ['she', 'has', 'cats', 'dogs']],
        'col_3' : ['some data 1', 'some data 2'],
        'col_4' : ['another data 1', 'another data 2']
    })
    df.head()
    

    Output

        col_1   col_2                   col_3       col_4
    0   [ducks] [he, eats, apples]      some data 1 another data 1
    1   [dogs]  [she, has, cats, dogs]  some data 2 another data 2
    

    Now let us apply stemming for the tokenized columns:

    import nltk
    from nltk.stem import SnowballStemmer
    stemmer = nltk.stem.SnowballStemmer('english')
    
    df.col_1 = df.apply(lambda row: [stemmer.stem(item) for item in row.col_1], axis=1)
    df.col_2 = df.apply(lambda row: [stemmer.stem(item) for item in row.col_2], axis=1)
    

    Check the new content of the dataframe.

    df.head()
    

    Output

        col_1   col_2                   col_3       col_4
    0   [duck]  [he, eat, appl]         some data 1 another data 1
    1   [dog]   [she, has, cat, dog]    some data 2 another data 2