Search code examples
bashloopsawkheaderfasta

add filename to fasta headers in a loop with awk?


I know this has been asked before but I cannot find a solution that is working - for some reason when I try any of the other solutions posted in stackoverflow they will simply NOT work

I have a directory that has 900+ fasta files, they all finish with ".faa" some of the names are:

TLLD001.faa TLLD002.faa TLLD003.faa TLLD004.faa TLLD005.faa

etc etc

within each file the headers of the fasta are:

   >scaffold4567
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >scaffold0034
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

or

   >NODE_212
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >NODE_86667
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

etc etc

I wanna go through all the files and replace the header by adding the filename for example, TLLD001.faa

   >scaffold4567
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >scaffold0034
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
   >scaffold7667
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >scaffold6778
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

should become

   >TLLD001_scaffold4567
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >TLLD001_scaffold0034
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ
   >TLLD001_scaffold7667
   WRVLSTSFNGIKYEQSAAFAMIPSTT
   >TLLD001_scaffold6778
   EQSAAFAMIPSTTSISWRVLSTSFNGIKYEQ

this is working nicely but i have to specify a single file every time $awk '/>/{sub(">","&"FILENAME"_");sub(/\.faa/,x)}1' TLLD001.faa

so not my cup of tea

this seems to have worked in 3-4 files i did as a test but it will not work in my 900+ files directory -takes forever-

for i in *.faa; do 
    sed -i "s/^>/>${i}_/g" *.faa
done

and the following are not working at all:

$for file in *.fasta; do awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $file > "`basename $file .fasta`_single-line.fasta"; done

and

$for file in *.faa; do awk '/>/{sub(">","&"${file}"_");sub(/\.faa/,x)}1' < $file > "`basename $file .faa`_mod.faa"; done

and I don't know why! any help and explanation of how to use this almighty but cryptic "awk" will be highly appreciated

thanks P


Solution

  • The sed solution is the way to go but you repeated the glob in the command!

    Instead of

    for f in *.faa; do sed -i "s/^>/>${f%.faa}/g" *.faa; done
    

    Use the ${f} variable in the sed command, otherwise it is expanded for the sed command again!

    for f in *.faa; do sed -i "s/^>/>${f%.faa}/g" "${f}"; done
    

    I also made us of some bash variable substituion to simply remove .faa from the file.