Optimize the python code to remove words from very large dataframe

My aim is to remove rare words from dataframe of size 3 million. Below code is taking very long time. Is there way i can optimize it?

rare_word=[]
for k,v in frequency_word.items():
    if v<=1:
        rare_word.append(k)

df['description']=df['description'].apply(lambda x:[i for i in x if i not in rare_word] )

Solution

Since rare_word is pretty big, the expression i not in rare_word will be slow because it does a linear search. You can speed this up by converting rare_word to a set with rare_word = set(rare_word). Sets not only perform not in in constant time due to hashing, they also remove the need for expensive string comparisons (also thanks to hashing). You can use a comprehension list to build the set faster:

# Note the presence of the '{}'
rare_word = {k for k,v in frequency_word.items() if v<=1}

It may be possible to optimize the code further but it is hard to say without more information on the dataframe. At least this optimisation should speed up the code by several order of magnitude.