The task is to find the longest sequence of a group
for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
and it has 7 occurrences of AGATC. (AGATC)
matches all occurrences.
Is it possible to write a regular expression that catches only the longest sequence, i.e. AGATCAGATCAGATCAGATCAGATC
in the given text?
If this is not possible only with regex, how can I iterate through each sequence (i.e. 1st sequence is AGATCAGATC
, 2nd - AGATCAGATCAGATCAGATCAGATC
et cetera) in python?
Use:
import re
sequence = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
matches = re.findall(r'(?:AGATC)+', sequence)
# To find the longest subsequence
longest = max(matches, key=len)
Explanation:
Non-capturing group (?:AGATC)+
+
Quantifier — Matches between one and unlimited times, as many times as possible.AGATC
matches the characters AGATC literally (case sensitive)Result:
# print(matches)
['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC']
# print(longest)
'AGATCAGATCAGATCAGATCAGATC'
You can test the regex here
.