Search code examples
pythonpandasstring-operationsboolean-indexing

How to speed up pandas boolean indexing with multiple string conditions


I have a 73 million row dataset, and I need to filter out rows that match any of a few conditions. I have been doing this with Boolean indexing, but it's taking a really long time (~30mins) and I want to know if I can make it faster (e.g. fancy indexing, np.where, np.compress?)

My code:

clean_df = df[~(df.project_name.isin(p_to_drop) | 
                df.workspace_name.isin(ws_to_drop) | 
                df.campaign_name.str.contains(regex_string,regex=True) | 
                df.campaign_name.isin(small_launches))]

The regex string is

regex_string = '(?i)^.*ARCHIVE.*$|^.*birthday.*$|^.*bundle.*$|^.*Competition followups.*$|^.*consent.*$|^.*DOI.*$|\
                    ^.*experiment.*$|^.*hello.*$|^.*new subscribers.*$|^.*not purchased.*$|^.*parent.*$|\
                    ^.*re engagement.*$|^.*reengagement.*$|^.*re-engagement.*$|^.*resend.*$|^.*Resend of.*$|\
                    ^.*reward.*$|^.*survey.*$|^.*test.*$|^.*thank.*$|^.*welcome.*$'

and the other three conditions are lists of strings with fewer than 50 items.


Solution

  • If you have this many rows, I think it will be faster to first remove the records one step at a time. Regex is typically slow, so you could use that as a last step with a much smaller data frame.

    For example:

    clean_df = df.copy()
    clean_df = clean_df.loc[~(df.project_name.isin(p_to_drop)]
    clean_df = clean_df.loc[~df.workspace_name.isin(ws_to_drop)]
    clean_df = clean_df.loc[~df.campaign_name.isin(small_launches)]
    clean_df = clean_df.loc[~df.campaign_name.str.contains(regex_string,regex=True)]