I have a file with multiple FASTA sequences such as:
File1.fa
>seq1
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
>seq2
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
And I have a table such as:
tab
Seq positions
seq1 3:10
seq2 10:20,45:60
And I would like for each tab['Seq'] to replace letters by a X for each corresponding seqn
positions within File1.fa
As you can see for the second row, I can have multiple positions to replace (these positions are separated by ,
in the tab['positions']
column.
Here I should then get a new_File1.fa such as:
>seq1
AAXXXXXXXXATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
>seq2
AAATTTTTAXXXXXXXXXXXATTACCATTACCATTACCATTACCXXXXXXXXXXXXXXXXTATTATTATTATATACCACACA
where for seq1
I replace X from positions 3 to 10, and for seq2
I replaced X from positions 10 to 20 and from positions 45 to 60 positions.
I guess using biopython package should be a solution here?
So far I tried the following:
record_dict = SeqIO.to_dict(SeqIO.parse("File1.fa, "fasta"))
import re
for index, row in tab.iterrows():
start= re.sub(":*.","",row['positions'])
end= re.sub(".*:","",row['positions'])
print(record_dict[Seq].seq[start-end])
But as you can see I only manage to extract the part I want to replace with X and I cannot figure out how to take into account when there are multiple positions to replace in the sequence.
Convert the sequences to lists, replace your chosen ranges then covert back to a string. For example,
seq = "AAABBBCCC"
s = list(seq)
for idx in range(3, 6):
s[idx] = "X"
new_seq = ''.join(s)
print(new_seq)