Search code examples
nlpnltkpython-3.7stemming

Query related to stemming in NLP


I am working on hands-on task based on stemming under NLP using python.

Below is the task which would required to be executed step wise to fetch the result.

I have completed till step 13 and got stuck at step number 14 and 15 (see below).

Please help me to know how to perform the step number 14 and 15.

TASK

  1. Import the text corpus brown.

  2. Extract the list of words associated with text collections belonging to the humor genre. Store the result in the variable humor_words.

  3. Convert each word of the list humor_words into lower case, and store the result in lc_humor_words.

  4. Find the list of unique words present in lc_humor_words. Store the result in lc_humor_uniq_words.

  5. Import the corpus words.

  6. Extract the list of words associated with the corpus words. Store the result in the variable wordlist_words.

  7. Find the list of unique words present in wordlist_words. Store the result in wordlist_uniq_words.

  8. Create an instance of PorterStemmer named, porter.

  9. Create an instance of LancasterStemmer named, lancaster.

  10. Stem each word present in lc_humor_uniq_words with porter instance, and store the result in the list p_stemmed

  11. Stem each word present in lc_humor_uniq_words with lancaster instance, and store the result in the listl_stemmed`

  12. Filter the stemmed words from p_stemmed which are also present in wordlist_uniq_words. Store the result in p_stemmed_in_wordlist.

  13. Filter the stemmed words from l_stemmed which are also present in wordlist_uniq_words. Store the result in l_stemmed_in_wordlist.

  14. Filter the words from lc_humor_uniq_words which have the same length as its corresponding stemmed word present in p_stemmed, and also contains at least one different character from the corresponding stemmed word. Store the result in the list p_stemmed_diff.

  15. Filter the words from lc_humor_uniq_words which have the same length as its corresponding stemmed word, present in l_stemmed, and also contains at least one different character from the corresponding stemmed word. Store the result in list l_stemmed_diff.

  16. Print the number of words present in p_stemmed_diff.

  17. Print the number of words present in l_stemmed_diff.

-Below is the which I have completed till step 13.

import nltk

import nltk.corpus

from nltk.corpus import brown

humor_words = brown.words(categories = 'humor')

lc_humor_words = [w.lower() for w in humor_words]

lc_humor_uniq_words = set(lc_humor_words)

from nltk.corpus import words

wordlist_words = words.words()

wordlist_uniq_words = set(wordlist_words)

from nltk.stem import PorterStemmer

porter = PorterStemmer()

from nltk.stem import LancasterStemmer

lancaster = LancasterStemmer()

p_stemmed = []

for word in lc_humor_uniq_words:

    p_stemmed.append(porter.stem(word))

l_stemmed = []

for wordd in lc_humor_uniq_words:

    l_stemmed.append(lancaster.stem(wordd))

p_stemmed_in_wordlist = [word1 for word1 in p_stemmed if word1 in wordlist_uniq_words]

l_stemmed_in_wordlist = [word2 for word2 in l_stemmed if word2 in wordlist_uniq_words]

Solution

  • Use below code for step 14-17

    p_stemmed_diff=[]
    for w1,w2 in zip(lc_humor_uniq_words,p_stemmed):
        if len(w1) == len(w2) and w1 != w2:
            p_stemmed_diff.append(w1)
    l_stemmed_diff=[]
    for w1,w2 in zip(lc_humor_uniq_words,l_stemmed):
        if len(w1) == len(w2) and w1 != w2:
            l_stemmed_diff.append(w1)
    print(len(p_stemmed_diff))
    print(len(l_stemmed_diff))