Search code examples
bashsequencebiopython

How to replace characters at specific positions from a file list?


I have a file containing a sequence:

    >sequence
TAGGACTGAGGGCTGGACAGGGCTGCGGGAG

and another one containing numbers refering to positions:

3
6
11

I would like to get a new file containing 'N' instead of A,C,G,T at the positions defined in the second file such as:

    >sequence
TANGANTGAGNGCTGGACAGGGCTGCGGGAG

Is there a way to do it using bash awk/sed or should I use a python script with SeqIO from biopython?

EDIT:

Here is a start for python script:

from Bio import SeqIO
import sys
import string
unput1=raw_input("enter sequence:")
unput2=raw_input("enter position file:")
fasta_file=unput1
position_file=unput2
result_file="outfile.fasta"
nb_list=list()
with open(position_file) as f:
    for line in f:
        line=line.strip()
        headerline = line.split()
        position=headerline[0]
        position_list.add(position)
for record in SeqIO.parse(StringIO(data), "fasta"):
    if record.id in nb_list:
        seq_record[position_list]="N"
        SeqIO.write([seq_record], f, "fasta")

Solution

  • Using awk with empty FS. This may not work with every awk version or with arbitrarily long sequences:

    $ awk 'BEGIN {
        FS=OFS=""               # process each char as an individual field
    }
    NR==FNR {                   # process the numbers file
        a[$0]                   # hash numbers to a hash
        next
    }
    /^[ACGT]/ {                 # process sequence file
        for(i=1;i<=NF;i++)      # itetate every field
            if(i in a)          # if i found in a
                $i="N"          # replace char with N
    }1' no-file seq-file
    

    Output:

        >sequence
    TANGANTGAGNGCTGGACAGGGCTGCGGGAG