Search code examples
pythonbioinformaticssequence-alignment

How to refine a python script for a bioinformatics query


I am quite new to python and I would be grateful for some assistance if possible. I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. I have run a FASTA pairwise alignment (glsearch36) using sequences from both organisms.

The below is a section of my python script where I have been able to identify a 7 nucleotide (heptamer) in one sequence (database) that corresponds to a gap in the other sequence (query). This is an example of what I have:

ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG

Assume the gap is at position 9. I am trying to refine the script to select gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides also match

ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
                 ||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC

This is the section of my script, the top half deals with opening different files. it also prints a dictionary with the count of each sequence at the end.

list_of_positions = []

for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]): 
    list_of_positions.append(match.start())

set_of_positions = set(list_of_positions)
for position in list_of_positions:
    list_no_indels = []
    for number in range(position-20, position) :
        list_no_indels.append(number)
    for number in range(position+1, position+21) :
        list_no_indels.append(number)
    set_no_indels = set(list_no_indels)
    if len(set_no_indels.intersection(set_of_positions))> 0 : continue

    if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue


    print position 
    #print match.start()

    print dict_seqs[E_F][position -3:position +3]

    key = dict_seqs[E_F][position -3: position +3]

    if nt_dict.has_key(key):
        nt_dict[key] += 1 
    else:
        nt_dict[key] = 1


print nt_dict

Essentially, I was trying to edit the results of pairwise alignments to try and identify the nucleotides opposite the gaps in both the query and database sequences in order to conduct some basic Insertion/Deletion analysis.

I was able to solve one of my earlier issues by increasing the distance between gaps "-" to 20 nt's in an attempt to reduce noise, this has improved my results. Script edited above.

This is an example of my results and at the end I have a dictionary which counts the occurences of each sequence.

ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)


ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)

However, I am still trying to fix the script where I get the nucleotides around the gap to match exactly such as this, where the | is just to show matching nt's on each sequence:

GGTTACCG-ACCTGTATGTGAACTCAACA # query
     ||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database

9
ACCGAC

Any help with this would be gratefully appreciated!


Solution

  • I think I understand what you are trying to do but as @alko has said - comments in your code will definitely help a lot.

    As to finding an exact match around the gap you could run a string comparison:

    Something along the lines of:

    if query[position -3: position] == database[position -3: position] and query[position +1: position +3] == database[position +1: position +3]:
       # Do something
    

    You will need to replace "query" and "database" with what you have called your strings that you want to compare.