I have a list of strings like this:
samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003', ...]
The list can be quite long (up to 10^6 elements in the worst case). I have another list containing some substrings:
matches = ['7895_001', '3458_669', '0345_123', ...]
I would like to create a list matched_samples
which contains only the elements of samples
which contain one or more element of matches
. For example, samples[1]
ends up in matched_samples
because matches[3]
is a substring of samples[1]
. I could do something like this:
matched_samples = [s for s in samples if any(xs in s for xs in matches)]
However, this looks like a double for loop, so it's not going to be fast. Is there any alternative? If samples
was a pandas
dataframe, I could simply do:
matches_regex = '|'.join(matches)
matched_samples = samples[samples['sample'].str.contains(matches_regex)]
Is there a similarly fast alternative with lists?
You can do the same thing as in your pandas example.
import re
samples = ['2345_234_1.0_1.35_001', '0345_123_2.09_1.3_003']
matches = ['7895_001', '3458_669', '0345_123']
pattern = re.compile(f"""{"|".join(re.escape(m) for m in matches)}""")
>>> [ s for s in samples if pattern.search(s) ]
['0345_123_2.09_1.3_003']
If there are no newlines in your samples - you could turn also turn that into a string and use .findall()
with (?:).*
around the pattern.
Not sure if that would make a difference speed-wise.
>>> pattern = re.compile(f"""(?:{"|".join(re.escape(m) for m in matches)}).*""")
>>> pattern
re.compile(r'(?:7895_001|3458_669|0345_123).*', re.UNICODE)
>>> pattern.findall("\n".join(samples))
['0345_123_2.09_1.3_003']