Search code examples
awkextractfasta

Use AWK on multi FASTA file to add new column based on contig header


I have a multi FASTA file that needs to be parsed so Glimmer multi-extract script can process it. It is composed of many contigs each with it's own header that starts with ">". What I need is to add each header as a new column, the problem is I don't know very much about the linux bash or awk for that matter.

>contig-7
orf00002     1741      461 
orf00003     3381     1747 
>Wcontig-7000023
>Wcontig-11112
orf00001      426     2648 
orf00002     2710     4581 
orf00003     4569     5480 
orf00004     6990     6133 
orf00006     9180     7108 
orf00007    10201     9209 
orf00008    11663    10203 
orf00009    12489    11680 
orf00010    13153    12473 
orf00011    14382    13225 
orf00013    14715    15968 
orf00014    19868    16410 
>Wcontig-1674000002
orf00001     2995      637 
orf00002     2497     1166 
orf00003     2984     2529

I need to have each contig header added as a first column along with a tab delimiter.

>contig-7
>contig-7   orf00002     1741      461 
>contig-7   orf00003     3381     1747 
>Wcontig-7000023
>Wcontig-11112
>Wcontig-11112  orf00001      426     2648 
>Wcontig-11112  orf00002     2710     4581 
>Wcontig-11112  orf00003     4569     5480 
>Wcontig-11112  orf00004     6990     6133 
>Wcontig-11112  orf00006     9180     7108 
>Wcontig-11112  orf00007    10201     9209 
>Wcontig-11112  orf00008    11663    10203 
>Wcontig-11112  orf00009    12489    11680 
>Wcontig-11112  orf00010    13153    12473 
>Wcontig-11112  orf00011    14382    13225 
>Wcontig-11112  orf00013    14715    15968 
>Wcontig-11112  orf00014    19868    16410 
>Wcontig-1674000002
>Wcontig-1674000002 orf00001     2995      637 
>Wcontig-1674000002 orf00002     2497     1166 
>Wcontig-1674000002 orf00003     2984     2529 

Also, after adding the new column I have to erase all the headers, so it would end up looking like this

>contig-7   orf00002     1741      461 
>contig-7   orf00003     3381     1747 
>Wcontig-11112  orf00001      426     2648 
>Wcontig-11112  orf00002     2710     4581 
>Wcontig-11112  orf00003     4569     5480 
>Wcontig-11112  orf00004     6990     6133 
>Wcontig-11112  orf00006     9180     7108 
>Wcontig-11112  orf00007    10201     9209 
>Wcontig-11112  orf00008    11663    10203 
>Wcontig-11112  orf00009    12489    11680 
>Wcontig-11112  orf00010    13153    12473 
>Wcontig-11112  orf00011    14382    13225 
>Wcontig-11112  orf00013    14715    15968 
>Wcontig-11112  orf00014    19868    16410 
>Wcontig-1674000002 orf00001     2995      637 
>Wcontig-1674000002 orf00002     2497     1166 
>Wcontig-1674000002 orf00003     2984     2529 

Solution

  • Awk can be really handy to solve this problem:

    awk '{if($1 ~ /contig/){c=$1}else{print c"\t"$0}}' <yourfile>