Search code examples
nlpnltkpython-3.7

Calculate number of filtered Bigrams


Working on Hands on problems on NLP and got stuck in TASK given below.

Below are the statements which are required to be executed in sequence.

I have completed the below steps but the fresco platform is not accepting the solution.

Please let me know what I did wrong in the below code and steps

TASK

1.Import text corpus brown

  1. Extract the list of words associated with text collections belonging to the news genre. Store the result in the variable news_words.

  2. Convert each word of the list news_words into lower case, and store the result in lc_news_words.

  3. Compute bigrams of the list lc_news_words, and store it in the variable lc_news_bigrams.

  4. From lc_news_bigrams, filter bigrams where both words contain only alphabet characters. Store the result in lc_news_alpha_bigrams.

  5. Extract the list of words associated with the corpus stopwords. Store the result in stop_words.

  6. Convert each word of the list stop_words into lower case, and store the result in lc_stop_words.

  7. Filter only the bigrams from lc_news_alpha_bigrams where the words are not part of lc_stop_words. Store the result in lc_news_alpha_nonstop_bigrams.

  8. Print the total number of filtered bigrams.

Below is the code which I have done so far. But fresco platform is not accepting the output.

import nltk

import nltk.corpus

from nltk.corpus import brown

from nltk.util import bigrams

from nltk.corpus import stopwords

news_words = brown.words(categories='news')

lc_news_words  = [w.lower() for w in news_words]

lc_news_bigrams = list(nltk.bigrams(lc_news_words))

lc_news_alpha_bigrams = [(word1, word2) for word1, word2 in lc_news_bigrams if (word1.isalpha() and word2.isalpha()) ]

stop_words = stopwords.words('english')

lc_stop_words = [w.lower() for w in stop_words ]

lc_news_alpha_nonstop_bigrams = [ (w1, w2) for w1, w2 in lc_news_alpha_bigrams if (w1.lower() not in lc_stop_words and w2.lower() not in lc_stop_words) ] 

len((lc_news_alpha_nonstop_bigrams))

Solution

  • You did everything correct, just remove the argument 'english' from

    stop_words = stopwords.words('english')
    
    stop_words = stopwords.words()
    
    

    will work