Search code examples
bashsedfasta

fasta file: replace header with filename


I want to replace all the headers (starting with >) with >{filename}, of all *.fasta files inside my directory AND concatenate them afterwards

content of my directory

speciesA.fasta
speciesB.fasta
speciesC.fasta

example of file, speciesA.fasta

>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

my desired output (only for speciesA.fasta now):

>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL

This is my code:

for file in *.fasta; do var=$(basename $file .fasta) | sed 's/>.*/>$var/' $var.fasta >>$var.outfile.fasta; done

but all I get is

>$var
MJSUNDKFJSKFJSKFJ
>$var
KEFJKSDJFKSDJFKSJFLSJDFLKSJF

[and so on ...]

Where did i make a mistake??


Solution

  • The bash loop is superfluous. Try:

    awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
    

    This approach is safe even if the file names contain special or regex-active characters.

    How it works

    • /^>/ {print ">" substr(FILENAME, 1, length(FILENAME)-6); next}

      For any line that begins >, the commands in curly braces are executed. The first command prints > followed by all but the last 6 letters of the filename. The second command, next, skips the rest of the commands on the line and jumps to start over with the next line.

    • 1

      This is awk's cryptic shorthand for print-the-line.

    Example

    Let's consider a directory with two (identical) test files:

    $ cat speciesA.fasta
    >protein1 description
    MJSUNDKFJSKFJSKFJ
    >protein2 anothername
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >protein3 somewordshere
    KSDAFJLASDJFKLAJFL
    $ cat speciesB.fasta
    >protein1 description
    MJSUNDKFJSKFJSKFJ
    >protein2 anothername
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >protein3 somewordshere
    KSDAFJLASDJFKLAJFL
    

    The output of our command is:

    $ awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
    >speciesA
    MJSUNDKFJSKFJSKFJ
    >speciesA
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >speciesA
    KSDAFJLASDJFKLAJFL
    >speciesB
    MJSUNDKFJSKFJSKFJ
    >speciesB
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >speciesB
    KSDAFJLASDJFKLAJFL
    

    The output has the substitutions and concatenates all the input files.