Search code examples
pythonpandasregexlistlist-comprehension

Fast way to select all the elements of a list of strings, which contain at least a substring from another list


I have a list of strings like this:

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]

The list can be quite long (up to 10^6 elements in the worst case). I have another list containing some substrings:

matches = ['7895_001', '3458_669', '0345_123', ...]

I would like to create a list matched_samples which contains only the elements of samples which contain one or more element of matches. For example, samples[1] ends up in matched_samples because matches[3] is a substring of samples[1]. I could do something like this:

matched_samples = [s for s in samples if any(xs in s for xs in matches)]

However, this looks like a double for loop, so it's not going to be fast. Is there any alternative? If samples was a pandas dataframe, I could simply do:

matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]

Is there a similarly fast alternative with lists?


Solution

  • You can do the same thing as in your pandas example.

    import re
    
    samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
    matches = ['7895_001', '3458_669', '0345_123']
    
    pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")
    
    >>> [ s for s in samples if pattern.search(s) ]
    ['0345_123_2.09_1.3_003']
    

    If there are no newlines in your samples - you could turn also turn that into a string and use .findall() with (?:).* around the pattern.

    Not sure if that would make a difference speed-wise.

    >>> pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
    >>> pattern
    re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
    >>> pattern.findall("\n".join(samples))
    ['0345_123_2.09_1.3_003']