I have the following series:
series = pd.Series(['ABC', 'AAABBB', 'BBBCCSSS', 'AABBCC'])
and I would like to check if each string contains A, B and C. Such as:
x = (series.str.contains('A')) & (series.str.contains('B')) & (series.str.contains('C'))
However I would like to be able to adjust the contents of the contains() function with a list like items = ['A', 'B', 'C']
or items = ['D', 'E', 'F', 'G']
etc.. which would build the above variable x
.
How could I create the variable x
iteratively using the list?
This is not the most efficient approach, so if you're comparing large amounts of data it may not be a workable solution, but it is possible to build a single regex.
series = pd.Series(['ABC', 'AAABBB', 'BBBCCSSS', 'AABBCC'])
items = ['A', 'B', 'C']
A way to say AND in regex terms is to use multiple lookahead assertions: (?=)
re.escape
to escape metacharacters (e.g .
, +
, etc)
''.join(f'(?=.*{re.escape(item)})' for item in items)
(?=.*A)(?=.*B)(?=.*C)
import re
pattern = ''.join(f'(?=.*{re.escape(item)})' for item in items)
series.str.contains(pattern)
0 True
1 False
2 False
3 True
dtype: bool
If you are comparing large amounts of data and speed if of a concern, rapidfuzz may be of interest.
You can use .cdist()
to compare everything in parallel and check all the scores are greater than 0:
(rapidfuzz.process.cdist(items, series, workers=-1) > 0).all(axis=0)
array([ True, False, False, True])