Search code examples
pythonpython-3.xlambdanltkstemming

Create list from list with function in pandas dataframe


I would like to create a new pandas column by running a word stemming function over a list of words in another column. I can tokenize a single string by using apply and lambda, but I cannot figure out how to extrapolate this to the case of running it over a list of words.

test = {'Statement' : ['congratulations on the future','call the mechanic','more text'], 'Other' : [2,3,4]}
df = pd.DataFrame(test)
df['tokenized'] = df.apply (lambda row: nltk.word_tokenize(row['Statement']), axis=1)

I know I could solve it with a nested for loop, but that seems inefficient and results in a SettingWithCopyWarning:

df['stems'] = ''
for x in range(len(df)):
    print(len(df['tokenized'][x]))
    df['stems'][x] = row_stems=[]
    for y in range(len(df['tokenized'][x])):
        print(df['tokenized'][x][y])
        row_stems.append(stemmer.stem(df['tokenized'][x][y]))

Isn't there a better way to do this?

EDIT:

Here's an example of what the result should look like:

    Other     Statement                       tokenized                             stems 
0   2         congratulations on the future   [congratulations, on, the, future]    [congratul, on, the, futur]
1   3         call the mechanic               [call, the, mechanic]                 [call, the, mechan]
2   4         more text                       [more, text]                          [more, text]

Solution

  • No need to run a loop, indeed. At least not an explicit loop. A list comprehension will work just fine.

    Assuming you use Porter stemmer ps:

    df['stems'] = df['tokenized'].apply(lambda words: 
                                        [ps.stem(word) for word in words])