Search code examples
pythonstringbioinformaticsbiopythondna-sequence

Find position of string (DNA sequence) on reference with mismatches ("N")


I am trying to find start and end positions along a genome alignment for a priming region that is non-contiguous, so essentially there are 2 regions. Here's a simplified example:

genome = "GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA"
primer = "AGCTGANNNNNTCGATGC"

This would align like this:

alignment of priming region on genome

I need to find the start and ending position on the genome where this priming region sits.

I tried doing this simply as a string by splitting where the N's are like this:

regions = primer.split('N')
start = genome.find(regions[0])
end = genome.find(regions[-1]) + len(regions[-1])

The problem with this is that on a large genome alignment, there will often be repeats of the shorter regions, so I end up with the wrong positions. As far as I can tell, there isn't anything in BioPython that does exactly this, and pairwise2 doesn't have a way to return the start and end positions.

Thank you.


Solution

  • Without regular expression you can solve this task in that way:

    genome = "GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA"
    primer = "AGCTGANNNNNTCGATGC"
    
    prefix, suffix = primer[:primer.find("N")], primer[primer.rfind("N") + 1:]
    
    start_pos = end_pos = -1
    while True:
        start_pos = genome.find(prefix, start_pos + 1)
        end_pos = start_pos < 0 and start_pos or genome.find(suffix, start_pos + len(prefix)) + len(suffix)
        if start_pos < 0 or end_pos - start_pos == len(primer):
           break
    
    print(start_pos, end_pos)
    

    Using regular expression:

    import re
    ...
    
    pattern = re.compile(primer.replace("N", "."))
    match = pattern.search(genome)
    if match:
        start_pos, end_pos = match.span()
        print(start_pos, end_pos)
    

    To print it in format from question use next code:

    print(genome)
    print(" " * start_pos + "|" * len(prefix) + "." * (len(primer) - len(suffix) - len(prefix)) + "|" * len(suffix))
    print("-" * start_pos + primer + "-" * (len(genome) - end_pos))