My dataframe has 2.3M rows. I am trying to get top 100 frequent words from this. I do not want punctuations, verbs, digits and ('a','the','an') I am using following query in python but takes forever to get results. Is there a quicker way to do it?
import re
import nltk
# Download NLTK data if you haven't already
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Comments_Final'])
unique_words = sorted(vectorizer.get_feature_names())
def count_words_without_punctuation_and_verbs(text):
words = re.findall(r'\b\w+\b', text.lower())
# Use NLTK to tag words and exclude verbs (VB* tags) and digits (CD tags)
tagged_words = nltk.pos_tag(words)
filtered_words = [word for word, pos in tagged_words if not pos.startswith('VB') and
not pos == 'CD']
return len(filtered_words)
# Create a dictionary to store word frequencies
word_frequencies = {}
for word in unique_words:
count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
word_frequencies[word] = count
# Sort the words by frequency in descending order
sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)
# Print the top 100 words
for word, frequency in sorted_words[:100]:
print(f"{word}: {frequency}")
Yes, there is a faster way. If you clean up your code a bit, you'll find it to be faster.
def count_words_without_punctuation_and_verbs(text)
Note how you call the above function later via for
loop:
for word in unique_words:
count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
word_frequencies[word] = count
The call to count_words_without_punctuation_and_verbs()
in each iteration means that you are redundantly tokenizing/tagging the entire DataFrame every iteration, which is obviously super inefficient.
return len(filtered_words)
This is also redundant. CountVectorizer
can produce this number as you use it to obtain your filtered words.
Remember that you don't always need to use NLTK. For example, isdigit()
is generally faster than NLTK's CD
for cardinal numbers.