Iterating the Series.str.contain function with an AND condition

I have the following series:

series = pd.Series(['ABC', 'AAABBB', 'BBBCCSSS', 'AABBCC'])

and I would like to check if each string contains A, B and C. Such as:

x = (series.str.contains('A')) & (series.str.contains('B')) & (series.str.contains('C'))

However I would like to be able to adjust the contents of the contains() function with a list like items = ['A', 'B', 'C'] or items = ['D', 'E', 'F', 'G'] etc.. which would build the above variable x.

How could I create the variable x iteratively using the list?

Solution

This is not the most efficient approach, so if you're comparing large amounts of data it may not be a workable solution, but it is possible to build a single regex.

series = pd.Series(['ABC', 'AAABBB', 'BBBCCSSS', 'AABBCC'])
items = ['A', 'B', 'C']

A way to say AND in regex terms is to use multiple lookahead assertions: (?=)
re.escape to escape metacharacters (e.g ., +, etc)

''.join(f'(?=.*{re.escape(item)})' for item in items)

(?=.*A)(?=.*B)(?=.*C)

import re

pattern = ''.join(f'(?=.*{re.escape(item)})' for item in items)

series.str.contains(pattern)

0     True
1    False
2    False
3     True
dtype: bool

rapidfuzz

If you are comparing large amounts of data and speed if of a concern, rapidfuzz may be of interest.

You can use .cdist() to compare everything in parallel and check all the scores are greater than 0:

(rapidfuzz.process.cdist(items, series, workers=-1) > 0).all(axis=0)

array([ True, False, False,  True])