Search code examples
pythonpandasregex

How do I create a regex dynamically using strings in a list for use in a pandas dataframe search?


The following code allows me to successfully identify the 2nd and 3rd texts, and only those texts, in a pandas dataframe by search for rows that contain the word "cod" or "i":

import numpy as np
import pandas as pd
texts_df = pd.DataFrame({"id":[1,2,3,4],
                      "text":["she loves coding", 
                              "he was eating cod",
                              "i do not like fish",
                              "fishing is not for me"]})

texts_df.loc[texts_df["text"].str.contains(r'\b(cod|i)\b', regex=True)]

enter image description here

I would like to build the list of words up dynamically by inserting words from a long list but I can't figure out how to do that successfully.

I've tried the following but I get an error saying "r is not defined" (which I expected as it's not a variable but I can't put it as part of the string either and don't know what I should do)

kw_list = ["cod", "i"]

kw_regex_string = "\b("
for kw in kw_list:
  kw_regex_string = kw_regex_string + kw + "|"
kw_regex_string = kw_regex_string[:-1]  # remove the final "|" at the end
kw_regex_string = kw_regex_string + ")\b"

myregex = r + kw_regex_string
texts_df.loc[texts_df["text"].str.contains(myregex, regex=True)]

How can I build the 'or' condition containing the list of key words and then insert that into the reg ex in a way that will work in the pandas dataframe search?


Solution

  • When I'm doing this, I wrap the list with map and re.escape to escape special characters that could have a regex meaning, then I join them with | as separator and I include this in the parentheses with string formatting:

    import re
    
    kw_list = ['cod', 'i']
    
    my_regex = r'\b(?:%s)\b' % '|'.join(map(re.escape, kw_list))
    
    texts_df.loc[texts_df['text'].str.contains(my_regex, regex=True)]
    

    Variant:

    my_regex = fr'\b(?:{"|".join(map(re.escape, kw_list))})\b'
    

    Crafted regex: '\\b(?:cod|i)\\b'

    Example of escaping of special characters:

    kw_list = ['10.00$', '*word*', '(A)']
    
    # crafted regex
    '\\b(?:10\\.00\\$|\\*word\\*|\\(A\\))\\b'