Define a function called performStemAndLemma
, which takes a parameter. The first parameter, textcontent
, is a string. The function definition code stub is given in the editor. Perform the following specified tasks:
1.Tokenize all the words given in textcontent
. The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tokenizedwords
. (Hint: Use regexp_tokenize)
Convert all the words into lowercase. Store the result into the variable tokenizedwords
.
Remove all the stop words from the unique set of tokenizedwords
. Store the result into the variable filteredwords
. (Hint: Use stopwords corpora)
Stem each word present in filteredwords
with PorterStemmer, and store the result in the list porterstemmedwords
.
Stem each word present in filteredwords
with LancasterStemmer, and store the result in the list lancasterstemmedwords
.
Lemmatize each word present in filteredwords
with WordNetLemmatizer, and store the result in the list lemmatizedwords
.
Return porterstemmedwords
, lancasterstemmedwords
, lemmatizedwords
variables from the function.
My code:
from nltk.corpus import stopwords
def performStemAndLemma(textcontent):
# Write your code here
#Step 1
tokenizedword = nltk.tokenize.regexp_tokenize(textcontent, pattern = '\w*', gaps = False)
#Step 2
tokenizedwords = [x.lower() for x in tokenizedword if x != '']
#Step 3
unique_tokenizedwords = set(tokenizedwords)
stop_words = set(stopwords.words('english'))
filteredwords = []
for x in unique_tokenizedwords:
if x not in stop_words:
filteredwords.append(x)
#Steps 4, 5 , 6
ps = nltk.stem.PorterStemmer()
ls = nltk.stem.LancasterStemmer()
wnl = nltk.stem.WordNetLemmatizer()
porterstemmedwords =[]
lancasterstemmedwords = []
lemmatizedwords = []
for x in filteredwords:
porterstemmedwords.append(ps.stem(x))
lancasterstemmedwords.append(ls.stem(x))
lemmatizedwords.append(wnl.lemmatize(x))
return porterstemmedwords, lancasterstemmedwords, lemmatizedwords
Still the program is not working fine. Not passing the 2 test cases. Highlight the mistake in above code and provide alternate solution for the same.
def performStemAndLemma(textcontent):
# Write your code here
import re
import nltk
from nltk.corpus import stopwords
from nltk import PorterStemmer, LancasterStemmer
pattern = r'\w*'
tokenizedwords = nltk.regexp_tokenize(textcontent, pattern, gaps=False)
tokenizedwords = [words for words in tokenizedwords if words !='']
uniquetokenizedwords = set(tokenizedwords)
tokenizedwords = [words.lower() for words in uniquetokenizedwords if words !='']
stop_words = set(stopwords.words('english'))
filteredwords = [words for words in tokenizedwords if words not in stop_words]
porterstemmedwords = nltk.PorterStemmer()
porterstemmedwords =[porterstemmedwords.stem(words) for words in filteredwords]
lancasterstemmedwords = nltk.LancasterStemmer()
lancasterstemmedwords =[lancasterstemmedwords.stem(words) for words in filteredwords]
wnl = nltk.WordNetLemmatizer()
lemmatizedwords = [wnl.lemmatize(word) for word in filteredwords ]
return porterstemmedwords, lancasterstemmedwords, lemmatizedwords