Search code examples
pythonpandasdataframeregex-lookaroundsany

How prevent catastrophic backtracking with multiple negative lookaheads


I have a dataframe with strings in one column. I would like to add the words 'section 22' to a string when it contains the word 'personal information'; at the same time, I would like the section 22 add-on to not happen if the string contains one of the following: s. 26, s. 29, s. 33, s. 22,or s. 32. Here is my dataframe:

df = pd.DataFrame({
    'Order': ['Order90-098','OrderF14-47', 'OrderF13-43', 'Order56-090', 'Order90-098', 'Order78-897'],
    'Ruling': ['foo','personal information', 's. 26 personal information', 'personal information s. 33', 'personal information s. 67', 'personal s. 32 information']})

Hoped for result:

df = pd.DataFrame({
    'Order': ['Order90-098','OrderF14-47', 'OrderF13-43', 'Order56-090', 'Order90-098', 'Order78-897'],
    'Ruling': ['foo','personal information section 22', 's. 26 personal information', 'personal information s. 33', 'personal information s. 67', 'personal s. 32 information']})

What I've figured out: I can add section 22 to a string if the string contains 'personal information', and I can also abort the operation if it contains the number 26.

df['Ruling'] = df['Ruling'].apply(lambda x: re.sub(r'^(?!.*26).*(personal information.*$)',r"\1 section 22", x, flags=re.I))

When I try to expand on the above solution by adding multiple negative lookaheads, I get a catastrophic backtracking error:

df['Ruling'] = df['Ruling'].apply(lambda x: re.sub(r'^(?!.*29).*(?!.*32).*(?!.*33).*(?!.*22).*(.*(?!.*26).*personal information.*$)',r"\1 section 22", x, flags=re.I))

When I try to use the disjunctive, 'personal information' matches even in a string with the number present:

df['Ruling'] = df['Ruling'].apply(lambda x: re.sub(r'^(?!29|32|33|22|26.*)(.*personal information.*$)',r"\1 section 22", x, flags=re.I))

I've thought about using any but don't know how it would work with re.sub.

Thanks in advance for your help.


Solution

  • You could use:

    df['Ruling']  = (df['Ruling']
                    .mask((~df['Ruling'].str.contains(r"s. [22|26|29|32|33]", regex = True)) &
                          (df['Ruling'].str.contains('personal information')), df['Ruling']+' section 22'))
    

    which gives

             Order                           Ruling
    0  Order90-098                              foo
    1  OrderF14-47  personal information section 22
    2  OrderF13-43       s. 26 personal information
    3  Order56-090       personal information s. 33
    4  Order90-098       personal information s. 67
    5  Order78-897       personal s. 32 information