Faster way Unique words frequency in NLTK

My dataframe has 2.3M rows. I am trying to get top 100 frequent words from this. I do not want punctuations, verbs, digits and ('a','the','an') I am using following query in python but takes forever to get results. Is there a quicker way to do it?

    import re
    import nltk
    # Download NLTK data if you haven't already
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['Comments_Final'])
    unique_words = sorted(vectorizer.get_feature_names())
    def count_words_without_punctuation_and_verbs(text):
        words = re.findall(r'\b\w+\b', text.lower())
        # Use NLTK to tag words and exclude verbs (VB* tags) and digits (CD tags)
        tagged_words = nltk.pos_tag(words)
        filtered_words = [word for word, pos in tagged_words if not pos.startswith('VB') and 
                          not pos == 'CD']
        return len(filtered_words)
   # Create a dictionary to store word frequencies
   word_frequencies = {}
   for word in unique_words:
   count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
   word_frequencies[word] = count

   # Sort the words by frequency in descending order
   sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)

   # Print the top 100 words
   for word, frequency in sorted_words[:100]:
   print(f"{word}: {frequency}")

Solution

Yes, there is a faster way. If you clean up your code a bit, you'll find it to be faster.

def count_words_without_punctuation_and_verbs(text)

Note how you call the above function later via for loop:

for word in unique_words:
   count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
   word_frequencies[word] = count

The call to count_words_without_punctuation_and_verbs() in each iteration means that you are redundantly tokenizing/tagging the entire DataFrame every iteration, which is obviously super inefficient.

return len(filtered_words)

This is also redundant. CountVectorizer can produce this number as you use it to obtain your filtered words.

Mini-sidebar

Remember that you don't always need to use NLTK. For example, isdigit() is generally faster than NLTK's CD for cardinal numbers.