python nltk preprocessor wordnet lemmatization

how to resolve the error: AttributeError: 'generator' object has no attribute 'endswith'

When I'm trying to run this code to preprocess a text, I get the error below, someone is having a similar problem but the post did not have enough details.

I am putting everything in context here hoping to help reviewer to help us better.

Here is the function;

def preprocessing(text):
    #text=text.decode("utf8")
    #tokenize into words
    tokens=[word for sent in nltk.sent_tokenize(text) for word in 
    nltk.word_tokenize(sent)]
    #remove stopwords
    stop=stopwords.words('english')
    tokens=[token for token in tokens if token not in stop]
    #remove words less than three letters
    tokens=[word for word in tokens if len(word)>=3]
    #lower capitalization
    tokens=[word.lower() for word in tokens]
    #lemmatization
    lmtzr=WordNetLemmatizer()
    tokens=[lmtzr.lemmatize(word for word in tokens)]
    preprocessed_text=' '.join(tokens)
    return preprocessed_text

calling the function here;

#open the text data from disk location
sms=open('C:/Users/Ray/Documents/BSU/Machine_learning/Natural_language_Processing_Pyhton_And_NLTK_Chap6/smsspamcollection/SMSSpamCollection')
sms_data=[]
sms_labels=[]
csv_reader=csv.reader(sms,delimiter='\t')
for line in csv_reader:
    #adding the sms_id
    sms_labels.append(line[0])
    #adding the cleaned text by calling the preprocessing method
    sms_data.append(preprocessing(line[1]))
sms.close()

result;

--------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last) <ipython-input-38-b42d443adaa6> in <module>()
      8     sms_labels.append(line[0])
      9     #adding the cleaned text by calling the preprocessing method
---> 10     sms_data.append(preprocessing(line[1]))
     11 sms.close()

<ipython-input-37-69ef4cd83745> in preprocessing(text)
     12     #lemmatization
     13     lmtzr=WordNetLemmatizer()
---> 14     tokens=[lmtzr.lemmatize(word for word in tokens)]
     15     preprocessed_text=' '.join(tokens)
     16     return preprocessed_text

~\Anaconda3\lib\site-packages\nltk\stem\wordnet.py in lemmatize(self, word, pos)
     38 
     39     def lemmatize(self, word, pos=NOUN):
---> 40         lemmas = wordnet._morphy(word, pos)
     41         return min(lemmas, key=len) if lemmas else word
     42 

~\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py in
_morphy(self, form, pos, check_exceptions)    1798     1799         # 1. Apply rules once to the input to get y1, y2, y3, etc.
-> 1800         forms = apply_rules([form])    1801     1802         # 2. Return all that are in the database (and check the original too)

~\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py in apply_rules(forms)    1777         def apply_rules(forms):    1778     return [form[:-len(old)] + new
-> 1779                     for form in forms    1780                     for old, new in substitutions    1781                     if form.endswith(old)]

~\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py in <listcomp>(.0)    1779                     for form in forms    1780   for old, new in substitutions
-> 1781                     if form.endswith(old)]    1782     1783         def filter_forms(forms):

AttributeError: 'generator' object has no attribute 'endswith'

I believe the error is coming from the source code for nltk.corpus.reader.wordnet

The whole source code can be seen in the nltk documentation page. It's too long to post here; but below is the raw link:

Thanks for your help.

Solution

The error message and traceback points you to the source of the problem:

in preprocessing(text) 12 #lemmatization 13 lmtzr=WordNetLemmatizer() ---> 14 tokens=[lmtzr.lemmatize(word for word in tokens)] 15 preprocessed_text=' '.join(tokens) 16 return preprocessed_text

~\Anaconda3\lib\site-packages\nltk\stem\wordnet.py in lemmatize(self, word, pos) 38 39 def lemmatize(self, word, pos=NOUN):

Obviously, from the function's signature (word, not words) and the error ("has no attribute 'endswith'" - endswith() is actually a str method), lemmatize() expects a single word, but here:

tokens=[lmtzr.lemmatize(word for word in tokens)]

you are passing a generator expression.

What you want is:

tokens = [lmtzr.lemmatize(word) for word in tokens]

NB : you mentions:

I believe the error is coming from the source code for nltk.corpus.reader.wordnet

The error is indeed raised in this package, but it "is coming from" (in the sense of "caused by") your code passing the wrong argument ;)

Hope this will help you debug this kind of problems by yourself next time.