Search code examples
python-3.xpandascontains

Iterating the Series.str.contain function with an AND condition


I have the following series:

series = pd.Series(['ABC', 'AAABBB', 'BBBCCSSS', 'AABBCC'])

and I would like to check if each string contains A, B and C. Such as:

x = (series.str.contains('A')) & (series.str.contains('B')) & (series.str.contains('C'))

However I would like to be able to adjust the contents of the contains() function with a list like items = ['A', 'B', 'C'] or items = ['D', 'E', 'F', 'G'] etc.. which would build the above variable x.

How could I create the variable x iteratively using the list?


Solution

  • This is not the most efficient approach, so if you're comparing large amounts of data it may not be a workable solution, but it is possible to build a single regex.

    series = pd.Series(['ABC', 'AAABBB', 'BBBCCSSS', 'AABBCC'])
    items = ['A', 'B', 'C']
    
    • A way to say AND in regex terms is to use multiple lookahead assertions: (?=)

    • re.escape to escape metacharacters (e.g ., +, etc)

    ''.join(f'(?=.*{re.escape(item)})' for item in items)
    
    (?=.*A)(?=.*B)(?=.*C)
    
    import re
    
    pattern = ''.join(f'(?=.*{re.escape(item)})' for item in items)
    
    series.str.contains(pattern)
    
    0     True
    1    False
    2    False
    3     True
    dtype: bool
    

    rapidfuzz

    If you are comparing large amounts of data and speed if of a concern, may be of interest.

    You can use .cdist() to compare everything in parallel and check all the scores are greater than 0:

    (rapidfuzz.process.cdist(items, series, workers=-1) > 0).all(axis=0)
    
    array([ True, False, False,  True])