Search code examples
pythonnlpnltk

When to remove stop words when using bigram_measures like PMI?


I need to verify an overall approach to dealing with bigram stop words that are returned from bigram_measures such as PMI. Why deal with these stop words? Well, they're noise and don’t add any additional value past a certain point.

I've seen several specific examples of how to use bigram_measures. However, I'm wondering WHEN it's best to remove stop word in the overall process of cleaning data, expansion, lemmatizing/stemming, etc.

And yes, I am using a corpus that is sufficiently large. I remember the size of your corpus will also affect the quality of the bigram_measures result.

Based on the accepted answer in this post (NLTK - Counting Frequency of Bigram) it seems that stop words could be removed after PMI or other bigram_measures are used on the corpus.

"Imagine that if filtering collocations was simply deleting them, then there were many probability measures such as liklihood ratio or the PMI itself (that compute probability of a word relative to other words in a corpus) which would not function properly after deleting words from random positions in the given corpus. By deleting some collocations from the given list of words, many potential functionalities and computations would be disabled..."

Therefore, I believe the best process is:

  1. Clean the text and remove garbage chars like HTML tags, etc.
  2. Expand contractions (e.g.: they're -> they are)
  3. Lemmatize or stem to normalize the words
  4. Calculate bigrams using bigram_measures like PMI. You can calculate bigrams using other methods, but this is what I'm using.
  5. Apply a frequency filter like "apply_freq_filter(N)" to get the bigrams that occur above your threshold. Note this will still return some bigrams with stop words mixed in with valuable bigrams.
  6. Check to see if BOTH words are stop words. If yes, then don't include that bigram in the final results but leave them in the corpus for the reasons quoted above.

Is this a correct overall approach to dealing with bigram stop words mixed in with valuable bigrams?


Solution

  • One approach is to:

    • clean the text
    • expand contractions
    • lemmatize
    • remove stop words
    • run PMI or other measure to score n-grams.

    Source: Text Analytics with Python, pg 224.

    My purpose in providing the source above is to show where I received this answer from rather than providing some ungrounded answer.