Search code examples
pythonpython-3.xregexcs50dna-sequence

How to catch the longest sequence of a group


The task is to find the longest sequence of a group

for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC" and it has 7 occurrences of AGATC. (AGATC) matches all occurrences. Is it possible to write a regular expression that catches only the longest sequence, i.e. AGATCAGATCAGATCAGATCAGATC in the given text? If this is not possible only with regex, how can I iterate through each sequence (i.e. 1st sequence is AGATCAGATC, 2nd - AGATCAGATCAGATCAGATCAGATC et cetera) in python?


Solution

  • Use:

    import re
    
    sequence = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
    matches = re.findall(r'(?:AGATC)+', sequence)
    
    # To find the longest subsequence
    longest = max(matches, key=len)
    

    Explanation:

    Non-capturing group (?:AGATC)+

    • + Quantifier — Matches between one and unlimited times, as many times as possible.
    • AGATC matches the characters AGATC literally (case sensitive)

    Result:

    # print(matches)
    ['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC']
    
    # print(longest)
    'AGATCAGATCAGATCAGATCAGATC'
    

    You can test the regex here.