I am quite new to python and I would be grateful for some assistance if possible. I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. I have run a FASTA pairwise alignment (glsearch36) using sequences from both organisms.
The below is a section of my python script where I have been able to identify a 7 nucleotide (heptamer) in one sequence (database) that corresponds to a gap in the other sequence (query). This is an example of what I have:
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG
Assume the gap is at position 9. I am trying to refine the script to select gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides also match
ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC
This is the section of my script, the top half deals with opening different files. it also prints a dictionary with the count of each sequence at the end.
list_of_positions = []
for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]):
list_of_positions.append(match.start())
set_of_positions = set(list_of_positions)
for position in list_of_positions:
list_no_indels = []
for number in range(position-20, position) :
list_no_indels.append(number)
for number in range(position+1, position+21) :
list_no_indels.append(number)
set_no_indels = set(list_no_indels)
if len(set_no_indels.intersection(set_of_positions))> 0 : continue
if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue
print position
#print match.start()
print dict_seqs[E_F][position -3:position +3]
key = dict_seqs[E_F][position -3: position +3]
if nt_dict.has_key(key):
nt_dict[key] += 1
else:
nt_dict[key] = 1
print nt_dict
Essentially, I was trying to edit the results of pairwise alignments to try and identify the nucleotides opposite the gaps in both the query and database sequences in order to conduct some basic Insertion/Deletion analysis.
I was able to solve one of my earlier issues by increasing the distance between gaps "-" to 20 nt's in an attempt to reduce noise, this has improved my results. Script edited above.
This is an example of my results and at the end I have a dictionary which counts the occurences of each sequence.
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)
ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)
However, I am still trying to fix the script where I get the nucleotides around the gap to match exactly such as this, where the | is just to show matching nt's on each sequence:
GGTTACCG-ACCTGTATGTGAACTCAACA # query
||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database
9
ACCGAC
Any help with this would be gratefully appreciated!
I think I understand what you are trying to do but as @alko has said - comments in your code will definitely help a lot.
As to finding an exact match around the gap you could run a string comparison:
Something along the lines of:
if query[position -3: position] == database[position -3: position] and query[position +1: position +3] == database[position +1: position +3]:
# Do something
You will need to replace "query" and "database" with what you have called your strings that you want to compare.