Search code examples
pythonnlpnltk

Faster way Unique words frequency in NLTK


My dataframe has 2.3M rows. I am trying to get top 100 frequent words from this. I do not want punctuations, verbs, digits and ('a','the','an') I am using following query in python but takes forever to get results. Is there a quicker way to do it?

    import re
    import nltk
    # Download NLTK data if you haven't already
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['Comments_Final'])
    unique_words = sorted(vectorizer.get_feature_names())
    def count_words_without_punctuation_and_verbs(text):
        words = re.findall(r'\b\w+\b', text.lower())
        # Use NLTK to tag words and exclude verbs (VB* tags) and digits (CD tags)
        tagged_words = nltk.pos_tag(words)
        filtered_words = [word for word, pos in tagged_words if not pos.startswith('VB') and 
                          not pos == 'CD']
        return len(filtered_words)
   # Create a dictionary to store word frequencies
   word_frequencies = {}
   for word in unique_words:
   count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
   word_frequencies[word] = count

   # Sort the words by frequency in descending order
   sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)

   # Print the top 100 words
   for word, frequency in sorted_words[:100]:
   print(f"{word}: {frequency}")
    

Solution

  • Yes, there is a faster way. If you clean up your code a bit, you'll find it to be faster.

    1. def count_words_without_punctuation_and_verbs(text)

    Note how you call the above function later via for loop:

    for word in unique_words:
       count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
       word_frequencies[word] = count
    

    The call to count_words_without_punctuation_and_verbs() in each iteration means that you are redundantly tokenizing/tagging the entire DataFrame every iteration, which is obviously super inefficient.

    1. return len(filtered_words)

    This is also redundant. CountVectorizer can produce this number as you use it to obtain your filtered words.

    1. Mini-sidebar

    Remember that you don't always need to use NLTK. For example, isdigit() is generally faster than NLTK's CD for cardinal numbers.