Search code examples
awksedwhile-loopfasta

Substitute pattern by an other pattern from two lists using while read line and sed


I try to replace in several files some patterns by other patterns. For example my infile looks like this:

>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC

and I want output:

>Genus_species_Something_something|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC

I have two list files, my old patterns:

Genus_species_SRR13259292

and new patterns:

Genus_species_Something_something

I tried to do this with sed. Here is my command:

while IFS= read -r line1 && IFS= read -r line2 <&3; do
    for f in *.fasta; do
        sed -e "s/${line1}/${line2}/g" "$f" > "${f%.fasta}_NewName.fasta"
    done
done < "List_oldpattern.txt" 3<"List_newpatterns.txt"

But this doesn't work, maybe it is because of the > and | delimited the pattern?

If sed doesn't work it may be possible with Awk?


Solution

  • Since the question has been tagged with awk I propose we replace all of OP's current code with a single awk script ...

    My sample .fasta files:

    $ head f?.fasta
    ==> f1.fasta <==
    >Genus_species_SRR13259292|ENSG00000000457_ENST00000367772           # change
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
    
    >Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # do not change
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
    
    ==> f2.fasta <==
    >Genus_species_SRR13259292|ENSG00000000457_ENST00000367772           # change
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
    
    >Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # do not change
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
    

    NOTE: files do not contain the comments

    We'll make use of the paste command to append OP's old and new patterns into a single line; we'll use a | as the delimiter:

    $ paste -d'|' List_oldpattern.txt List_newpatterns.txt
    Genus_species_SRR13259292|Genus_species_Something_something
    

    Now the awk script:

    awk '
    BEGIN     { FS = OFS = "|" }                    # input/output field delimiter
    FNR==NR   { map[">" $1] = ">" $2; next }        # 1st file (paste output): populate our map[] array; $1==old $2==new; then skip to next input line
    FNR==1    { close(outf)                         # 2nd-nth files: 1st record; close previous output file
                outf = FILENAME                     # make copy of input FILENAME
                sub(/\.fasta$/,"",outf)             # strip trailing ".fasta"
                outf = outf "_NewName.fasta"        # append new suffix to our output filename
              }
    $1 in map { $1 = map[$1] }                      # if 1st field (">some_string") is an index in the map[] array then replace 1st field with array contents
              { print > outf }                      # print current line to output file
    
    ' <(paste -d'|' List_oldpattern.txt List_newpatterns.txt) *.fasta
    

    NOTE: assuming OP has more than one old/new pattern pair, this script has the added benefit of only scanning each *.fasta file once (as opposed to OP's current while/read/for/sed loop which scans each .fasta file N times - where N is the number of old/new pattern pairs)

    This generates:

    $ head *_NewName.fasta
    ==> f1_NewName.fasta <==
    >Genus_species_Something_something|ENSG00000000457_ENST00000367772   # changed
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
    
    >Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # not changed
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
    
    ==> f2_NewName.fasta <==
    >Genus_species_Something_something|ENSG00000000457_ENST00000367772   # changed
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
    
    >Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772           # not changed
    TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....