Search code examples
pythonpython-3.xnlpnltk

NLP Stemming and Lemmatization using Regular expression tokenization


Define a function called performStemAndLemma, which takes a parameter. The first parameter, textcontent, is a string. The function definition code stub is given in the editor. Perform the following specified tasks:

1.Tokenize all the words given in textcontent. The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tokenizedwords. (Hint: Use regexp_tokenize)

  1. Convert all the words into lowercase. Store the result into the variable tokenizedwords.

  2. Remove all the stop words from the unique set of tokenizedwords. Store the result into the variable filteredwords. (Hint: Use stopwords corpora)

  3. Stem each word present in filteredwords with PorterStemmer, and store the result in the list porterstemmedwords.

  4. Stem each word present in filteredwords with LancasterStemmer, and store the result in the list lancasterstemmedwords.

  5. Lemmatize each word present in filteredwords with WordNetLemmatizer, and store the result in the list lemmatizedwords.

Return porterstemmedwords, lancasterstemmedwords, lemmatizedwords variables from the function.

My code:

from nltk.corpus import stopwords
def performStemAndLemma(textcontent):
    # Write your code here
    #Step 1
    tokenizedword = nltk.tokenize.regexp_tokenize(textcontent, pattern = '\w*', gaps = False)
    #Step 2
    tokenizedwords = [x.lower() for x in tokenizedword if x != '']
    #Step 3
    unique_tokenizedwords = set(tokenizedwords)
    stop_words = set(stopwords.words('english')) 
    filteredwords = []
    for x in unique_tokenizedwords:
        if x not in stop_words:
            filteredwords.append(x)
    #Steps 4, 5 , 6
    ps = nltk.stem.PorterStemmer()
    ls = nltk.stem.LancasterStemmer()
    wnl = nltk.stem.WordNetLemmatizer()
    porterstemmedwords =[]
    lancasterstemmedwords = []
    lemmatizedwords = []
    for x in filteredwords:
        porterstemmedwords.append(ps.stem(x))
        lancasterstemmedwords.append(ls.stem(x))
        lemmatizedwords.append(wnl.lemmatize(x))
    return porterstemmedwords, lancasterstemmedwords, lemmatizedwords

Still the program is not working fine. Not passing the 2 test cases. Highlight the mistake in above code and provide alternate solution for the same.


Solution

  • def performStemAndLemma(textcontent):
        # Write your code here
        import re
        import nltk
        from nltk.corpus import stopwords
        from nltk import PorterStemmer, LancasterStemmer
        
        pattern =  r'\w*' 
        tokenizedwords = nltk.regexp_tokenize(textcontent, pattern, gaps=False)
        tokenizedwords = [words for words in tokenizedwords if words !='']
        
        uniquetokenizedwords = set(tokenizedwords)
        tokenizedwords = [words.lower() for words in uniquetokenizedwords if words !='']
        
        stop_words = set(stopwords.words('english'))
        filteredwords = [words for words in tokenizedwords if words not in stop_words]
    
        porterstemmedwords = nltk.PorterStemmer()
        porterstemmedwords =[porterstemmedwords.stem(words) for words in filteredwords]
        
        lancasterstemmedwords = nltk.LancasterStemmer()
        lancasterstemmedwords =[lancasterstemmedwords.stem(words) for words in filteredwords]
        
        wnl = nltk.WordNetLemmatizer()
        lemmatizedwords = [wnl.lemmatize(word) for word in filteredwords ]
        
        return porterstemmedwords, lancasterstemmedwords, lemmatizedwords