Hi there I've been playing a bit with for loops in BASH to edit a FASTA file.
The file has 24 headers that start with '>' character, as follow:
>CP068277.2
>CP068276.2
>CP068275.2
>CP068274.2
>CP068273.2
>CP068272.2
>CP068271.2
>CP068270.2
>CP068269.2
>CP068268.2
>CP068267.2
>CP068266.2
>CP068265.2
>CP068264.2
>CP068263.2
>CP068262.2
>CP068261.2
>CP068260.2
>CP068259.2
>CP068258.2
>CP068257.2
>CP068256.2
>CP068255.2
>CP086569.2
These are actually chromosomes and I need them to be in the form of >chm1
, >chm2
, etc.
I wrote the following for loop:
for ((c=1; c<=24; c++));
do
sed 's/>/>chr'"$c"' /' CHM13v2.0_no-mito.fna > CHM13v2.0_no-mito_trial.fna;
done
The output is, however, showing only >chm24
without accounting for the count operation (see below)..., anyone has any idea why?
>chr24 CP068277.2
>chr24 CP068276.2
>chr24 CP068275.2
>chr24 CP068274.2
>chr24 CP068273.2
>chr24 CP068272.2
>chr24 CP068271.2
>chr24 CP068270.2
>chr24 CP068269.2
>chr24 CP068268.2
>chr24 CP068267.2
>chr24 CP068266.2
>chr24 CP068265.2
>chr24 CP068264.2
>chr24 CP068263.2
>chr24 CP068262.2
>chr24 CP068261.2
>chr24 CP068260.2
>chr24 CP068259.2
>chr24 CP068258.2
>chr24 CP068257.2
>chr24 CP068256.2
>chr24 CP068255.2
>chr24 CP086569.2
P.S. no worries for the sequences following the >chm24
, I have a way to remove them with sed
; nonetheless, it would be nice to have everything done in one step
Thanks in advance!
Your loop is overwriting the output file on each iteration, the syntax for what you're trying to do would be:
for ((c=1; c<=24; c++));
do
sed 's/>/>chr'"$c"' /' CHM13v2.0_no-mito.fna
done > CHM13v2.0_no-mito_trial.fna
but this would be orders of magnitude more efficient and doesn't hard-code how many header lines you hope the file contains:
awk 'sub(/>/,""){$0=">chr" (++c) " " $0} 1' CHM13v2.0_no-mito.fna > CHM13v2.0_no-mito_trial.fna