My aim is to remove rare words from dataframe of size 3 million. Below code is taking very long time. Is there way i can optimize it?
rare_word=[]
for k,v in frequency_word.items():
if v<=1:
rare_word.append(k)
df['description']=df['description'].apply(lambda x:[i for i in x if i not in rare_word] )
Since rare_word
is pretty big, the expression i not in rare_word
will be slow because it does a linear search. You can speed this up by converting rare_word
to a set with rare_word = set(rare_word)
. Sets not only perform not in
in constant time due to hashing, they also remove the need for expensive string comparisons (also thanks to hashing). You can use a comprehension list to build the set faster:
# Note the presence of the '{}'
rare_word = {k for k,v in frequency_word.items() if v<=1}
It may be possible to optimize the code further but it is hard to say without more information on the dataframe. At least this optimisation should speed up the code by several order of magnitude.