I am trying to identify substrings in a given string using str.contains while mixing the OR and AND
I know that OR can be represented by |
str.contains("error|break|insufficient")
and that AND can be represented by AND
str.contains("error|break|insufficient") & str.contains("status")
I would like to mix the OR and AND together. Example is to identify strings that have "error" OR "break OR ("insufficient" AND "status")
So for sentence like "error break insufficient" -> it will be able to identify. But now is not able to because there is no "status" in the sentence
One approach:
import pandas as pd
# toy data
s = pd.Series(["hello", "world", "simple", "error", "break", "insufficient something status", "status"])
# create and mask
insufficient_and_status = s.str.contains("insufficient") & s.str.contains("status")
# create or mask
break_or_error = s.str.contains("error|break", regex=True)
# or the two mask
mask = break_or_error | insufficient_and_status
res = s[mask]
print(res)
Output
3 error
4 break
5 insufficient something status
dtype: object
Alternative, using a single regex:
mask = s.str.contains("error|break|(insufficient.+status|status.+insufficient)", regex=True)
res = s[mask]
print(res)
The alternative is based on the fact that if the string contains insufficient and status then at least one of the patterns insufficient.+status
or status.+insufficient
matches (i.e. or insufficient occurs first or status does)