Use Simplemma on whole dataset in Python

I want to use simplemma on my dataset. I know how the script works for separate words:

from simplemma import text_lemmatizer
langdata=simplemma.load_data('nl')
text_lemmatizer('word1 word2 word3', langdata)

But how do I change this script in order to make it work for a complete column ['Text'] in my dataset df? Each row in that column contains multiple words.

I've made the following script:

from simplemma import text_lemmatizer
langdata=simplemma.load_data('nl')
text_lemmatizer(df['Tekst'], langdata)

But I get this error when I run the script:

TypeError:expected string or bytes-like object.

What is wrong in my script and how can I make it work? Tnx!

Solution

Use the .apply() function together with word_tokenize() in order to lemmatize a dataframe column, such as:

from nltk import word_tokenize
from simplemma import text_lemmatizer
langdata = simplemma.load_data('nl')   # dutch

dataframe_name['column_name'].apply(lambda x: ' '.join([simplemma.lemmatize(str(word), langdata) for word in word_tokenize(str(x))]))

Additionally, tokenization via further arguments:

word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))

And lastly, removal of any unwanted stopwords:

from nltk.corpus import stopwords
.... str(x))) if word.lower() not in stopwords])