I have a pd dataframe that has thousands of rows. Each row contains column labeled text_processed that contains text. These texts may be long, hundreds of words per row/text. Now I want to eliminate words that show up in 95% of the rows. What I am doing is join all of the text into one large string and tokenize that string. I now have a vocabulary of all the words in all of the texts. I now want to get the number of rows each word is in. A simple (and slow) way is to loop over each word and compare if the word exists in the column and sum this result to get the number of rows this word is in. This can be seen here:
wordcountPerRow = []
for word in all_words:
if word in [':', '•', 'and', '%', '\\', '|', '-', 'no', 'of', ')', '(', '[', ']', '--', '/', '*', ';', '`', '``', '\'\'', '+']:
continue
try:
wordcountPerRow.append([word, df_note['text_processed'].str.contains(r''+word).sum()])
except:
print(word)
Once I have all of the sums I will just do len(df)*0.95
and see if the number of rows for a word >= 95% and eliminate that word if true (Boolean column). This process seems slow and computationally expensive. Is there a way I can speed this up? Could I use a count vectorizer?
Similar to this: removing words that appear more than x% in a corpus Python
It looks like you can use count vectorizer with a slight twist. The twist is that because countvectorizer counts the number of occurrences per document we can simply apply a bool mask (count_vector > 0) where if it occurs 1 or more times in a document it will mask it as 1, and if its 0 its 0 and it doesn't contribute to the sum. From here we can transpose, put the index as the feature names and simply select out the percent intervals we want.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(df_note['text_processed'].tolist())#,max_df=0.8, min_df=0.1)#this eliminates words in top 0.8
count_vector=cv.fit_transform(df_note['text_processed'].tolist())
#number of documents a word occurs in
word_document_count = pd.DataFrame(np.array(np.sum(count_vector > 0, axis=0)).transpose() \
, index=cv.get_feature_names(), columns=['Document Count'])
top_perc_num = len(df_note)*0.8
bottom_perc_num = len(df_note)*0.2
word_document_count_trunc = word_document_count[(word_document_count['Document Count'] < top_perc_num) & (word_document_count['Document Count'] > bottom_perc_num)]
I believe this is a much faster way to accomplish the task. My only gripe is that it seems like the numbers were slightly off from the original method. I tried a small reproducible example but the results were identical.
This works for 200k+ words in the vocab and 90k+ rows