Search code examples
pythonnltkstemminglemmatization

NLTK-based stemming and lemmatization


I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.

from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)

The final output I get is:

This beautiful day16~ . I ; working exercise45.^^^45 text34 .

And expected output should look like:

This beautiful day I work exercise text

Solution

  • No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).

    import re
    
    __stop_words = set(nltk.corpus.stopwords.words('english'))
    
    def clean(tweet):
        cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
        return ' '.join([lemmatizer.lemmatize(i, 'v') 
                    for i in cleaned_tweet.split() if i not in __stop_words])
    

    Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.

    from nltk.stem.porter import PorterStemmer  
    stemmer = PorterStemmer() 
    

    And, call the stemmer like this:

    stemmer.stem(i)