python pandas regex list list-comprehension

Fast way to select all the elements of a list of strings, which contain at least a substring from another list

I have a list of strings like this:

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]

The list can be quite long (up to 10^6 elements in the worst case). I have another list containing some substrings:

matches = ['7895_001', '3458_669', '0345_123', ...]

I would like to create a list matched_samples which contains only the elements of samples which contain one or more element of matches. For example, samples[1] ends up in matched_samples because matches[3] is a substring of samples[1]. I could do something like this:

matched_samples = [s for s in samples if any(xs in s for xs in matches)]

However, this looks like a double for loop, so it's not going to be fast. Is there any alternative? If samples was a pandas dataframe, I could simply do:

matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]

Is there a similarly fast alternative with lists?

Solution

You can do the same thing as in your pandas example.

import re

samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
matches = ['7895_001', '3458_669', '0345_123']

pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")

>>> [ s for s in samples if pattern.search(s) ]
['0345_123_2.09_1.3_003']

If there are no newlines in your samples - you could turn also turn that into a string and use .findall() with (?:).* around the pattern.

Not sure if that would make a difference speed-wise.

>>> pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
>>> pattern
re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
>>> pattern.findall("\n".join(samples))
['0345_123_2.09_1.3_003']