I need to verify an overall approach to dealing with bigram stop words that are returned from bigram_measures such as PMI. Why deal with these stop words? Well, they're noise and don’t add any additional value past a certain point.
I've seen several specific examples of how to use bigram_measures. However, I'm wondering WHEN it's best to remove stop word in the overall process of cleaning data, expansion, lemmatizing/stemming, etc.
And yes, I am using a corpus that is sufficiently large. I remember the size of your corpus will also affect the quality of the bigram_measures result.
Based on the accepted answer in this post (NLTK - Counting Frequency of Bigram) it seems that stop words could be removed after PMI or other bigram_measures are used on the corpus.
"Imagine that if filtering collocations was simply deleting them, then there were many probability measures such as liklihood ratio or the PMI itself (that compute probability of a word relative to other words in a corpus) which would not function properly after deleting words from random positions in the given corpus. By deleting some collocations from the given list of words, many potential functionalities and computations would be disabled..."
Therefore, I believe the best process is:
Is this a correct overall approach to dealing with bigram stop words mixed in with valuable bigrams?
One approach is to:
Source: Text Analytics with Python, pg 224.
My purpose in providing the source above is to show where I received this answer from rather than providing some ungrounded answer.