Search code examples
pythonnltkapplysnowball

Passing value in column as parameter in apply with nltk snowball stemmer


Passing df[language] works for stopwords but not for snowball stemmer. Is there a way I can get around that?

I haven't really found any clues so far...

import nltk
from nltk.corpus import stopwords
import pandas as pd
import re

df = pd.DataFrame([['A sentence in English', 'english'], ['En mening på svenska', 'swedish']], columns = ['text', 'language'])

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

def remove_stopwords(tokenized_list, language):
    stopword = nltk.corpus.stopwords.words(language)
    text = [word for word in tokenized_list if word not in stopword]
    return text

def stemming(tokenized_text, l):
    ss = nltk.stem.SnowballStemmer(l)
    text = [ss.stem(word) for word in tokenized_text]
    return text

df['text_tokenized'] = df['text'].apply(lambda x: tokenize(x.lower()))
df['text_nostop'] = df['text_tokenized'].apply(lambda x: remove_stopwords(x, df['language']))
df['text_stemmed'] = df['text_nostop'].apply(lambda x: stemming(x, df['language']))

I expected it to do snowball stemming using english and swedish as language in the same way as the stopwords removal do. I get error message as below:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


Solution

  • Try this instead.

    df['text_stemmed']=df.apply(lambda x: stemming(x['text_nostop'], x['language']), axis=1)
    

    Edit: when you use apply on a specific column like df['text_tokenized'].apply(lambda x: ...), the lambda function is on x, which is each row of the text_tokenized column, whereas df['language'] is not applied to a specific row, but the entire pandas Series.

    That is, when you try lambda x: remove_stopwords(x, df['language']), the returned value of df['language'] is not the certain 'language' value of the corresponding row, but instead it's a pandas Series containing both 'english' and 'swedish'.

    0    english
    1    swedish
    

    So your second code with apply should be changed too:

    df['text_nostop'] = df.apply(lambda x: remove_stopwords(x['text_tokenized'], x['language']), axis=1)