Search code examples
pythonbiopythonfasta

Add X's within certain positions of a multifasta file


I have a file with multiple FASTA sequences such as:

File1.fa

>seq1
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
>seq2
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA

And I have a table such as:

tab

Seq positions 
seq1 3:10
seq2 10:20,45:60

And I would like for each tab['Seq'] to replace letters by a X for each corresponding seqn positions within File1.fa

As you can see for the second row, I can have multiple positions to replace (these positions are separated by , in the tab['positions'] column.

Here I should then get a new_File1.fa such as:

>seq1
AAXXXXXXXXATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA

>seq2
AAATTTTTAXXXXXXXXXXXATTACCATTACCATTACCATTACCXXXXXXXXXXXXXXXXTATTATTATTATATACCACACA

where for seq1 I replace X from positions 3 to 10, and for seq2 I replaced X from positions 10 to 20 and from positions 45 to 60 positions.

I guess using biopython package should be a solution here?

So far I tried the following:

record_dict = SeqIO.to_dict(SeqIO.parse("File1.fa, "fasta"))

import re 
for index, row in tab.iterrows():
 start= re.sub(":*.","",row['positions'])
 end= re.sub(".*:","",row['positions'])
 print(record_dict[Seq].seq[start-end])

But as you can see I only manage to extract the part I want to replace with X and I cannot figure out how to take into account when there are multiple positions to replace in the sequence.


Solution

  • Convert the sequences to lists, replace your chosen ranges then covert back to a string. For example,

    seq = "AAABBBCCC"
    s = list(seq)
    
    for idx in range(3, 6):
        s[idx] = "X"
    
    new_seq = ''.join(s)    
    print(new_seq)