Search code examples
pythonstemming

Use Simplemma on whole dataset in Python


I want to use simplemma on my dataset. I know how the script works for separate words:

from simplemma import text_lemmatizer
langdata=simplemma.load_data('nl')
text_lemmatizer('word1 word2 word3', langdata)

But how do I change this script in order to make it work for a complete column ['Text'] in my dataset df? Each row in that column contains multiple words.

I've made the following script:

from simplemma import text_lemmatizer
langdata=simplemma.load_data('nl')
text_lemmatizer(df['Tekst'], langdata)

But I get this error when I run the script:

TypeError:expected string or bytes-like object.

What is wrong in my script and how can I make it work? Tnx!


Solution

  • Use the .apply() function together with word_tokenize() in order to lemmatize a dataframe column, such as:

    from nltk import word_tokenize
    from simplemma import text_lemmatizer
    langdata = simplemma.load_data('nl')   # dutch
    
    dataframe_name['column_name'].apply(lambda x: ' '.join([simplemma.lemmatize(str(word), langdata) for word in word_tokenize(str(x))]))
    

    Additionally, tokenization via further arguments:

    word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))
    

    And lastly, removal of any unwanted stopwords:

    from nltk.corpus import stopwords
    .... str(x))) if word.lower() not in stopwords])