I want to use simplemma on my dataset. I know how the script works for separate words:
from simplemma import text_lemmatizer
langdata=simplemma.load_data('nl')
text_lemmatizer('word1 word2 word3', langdata)
But how do I change this script in order to make it work for a complete column ['Text'] in my dataset df? Each row in that column contains multiple words.
I've made the following script:
from simplemma import text_lemmatizer
langdata=simplemma.load_data('nl')
text_lemmatizer(df['Tekst'], langdata)
But I get this error when I run the script:
TypeError:expected string or bytes-like object.
What is wrong in my script and how can I make it work? Tnx!
Use the .apply() function together with word_tokenize() in order to lemmatize a dataframe column, such as:
from nltk import word_tokenize
from simplemma import text_lemmatizer
langdata = simplemma.load_data('nl') # dutch
dataframe_name['column_name'].apply(lambda x: ' '.join([simplemma.lemmatize(str(word), langdata) for word in word_tokenize(str(x))]))
Additionally, tokenization via further arguments:
word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))
And lastly, removal of any unwanted stopwords:
from nltk.corpus import stopwords
.... str(x))) if word.lower() not in stopwords])