Search code examples
pythonregexstringsubstringcontains

identify substring using str.contains by mixing AND and OR


I am trying to identify substrings in a given string using str.contains while mixing the OR and AND

I know that OR can be represented by |

str.contains("error|break|insufficient")

and that AND can be represented by AND

str.contains("error|break|insufficient") & str.contains("status")

I would like to mix the OR and AND together. Example is to identify strings that have "error" OR "break OR ("insufficient" AND "status")

So for sentence like "error break insufficient" -> it will be able to identify. But now is not able to because there is no "status" in the sentence


Solution

  • One approach:

    import pandas as pd
    
    # toy data
    s = pd.Series(["hello", "world", "simple", "error", "break", "insufficient something status", "status"])
    
    # create and mask
    insufficient_and_status = s.str.contains("insufficient") & s.str.contains("status")
    
    # create or mask
    break_or_error = s.str.contains("error|break", regex=True)
    
    # or the two mask
    mask = break_or_error | insufficient_and_status
    
    res = s[mask]
    print(res)
    

    Output

    3                            error
    4                            break
    5    insufficient something status
    dtype: object
    

    Alternative, using a single regex:

    mask = s.str.contains("error|break|(insufficient.+status|status.+insufficient)", regex=True)
    
    res = s[mask]
    print(res)
    

    The alternative is based on the fact that if the string contains insufficient and status then at least one of the patterns insufficient.+status or status.+insufficient matches (i.e. or insufficient occurs first or status does)