Search code examples
pythonnltktokentokenizestemming

how to create a function that tokenizes and stems the words


My code

def tokenize_and_stem(text):

    tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]

    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

    stems = stemmer.stem(filtered_tokens)

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)

and I'm getting this error

AttributeError Traceback (most recent call last) in 13 return stems 14 ---> 15 words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.") 16 print(words_stemmed)

in tokenize_and_stem(text) 9
10 # Stem the filtered_tokens ---> 11 stems = stemmer.stem(filtered_tokens) 12
13 return stems

/usr/local/lib/python3.6/dist-packages/nltk/stem/snowball.py in stem(self, word) 1415 1416 """ -> 1417 word = word.lower() 1418 1419 if word in self.stopwords or len(word) <= 2:

AttributeError: 'list' object has no attribute 'lower'


Solution

  • YOUR CODE

    def tokenize_and_stem(text):
    
    tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]
    
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    
    stems = stemmer.stem(filtered_tokens)
    
    words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's 
    wedding.")
    print(words_stemmed)
    

    The error says """word = word.lower()... if word in self.stopwords or len(word) <= 2: list object has no attribute 'lower'"""

    The error is not only because of .lower() but because of the length If you try to run it with out changing the filtered_tokens on the 5th line, without changing means using yours. you will get no error but the output will be like this:

    ["today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding."]

    Here is your fixed code.

    def tokenize_and_stem(text):
    
        tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    
        filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    
        stems = [stemmer.stem(t) for t in filtered_tokens if len(t) > 0]
    
        return stems
    
    words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
    print(words_stemmed)
    

    So, i have only changed line 3 and line 7