Search code examples
fasta

Rename multiple header in a fasta file to leave only the numbers


I have a fasta file with multiple headers:

     >CABITT030000001.1 genome assembly, contig: 1, whole genome shotgun sequence
    
     >CABITT030000002.1 genome assembly, contig: 2, whole genome shotgun sequence

.
.
.
.

And I would like to leave only the 1 and 2 either from the CABITT03000000*.1 or the number after the contig: string.

Output:

>1
>2

I was trying it with sed command, but it doesnt work.

sed 's/>.*/>1/' fasta.fa > newfasta.fa

Solution

  • Going on the example input you provided, this should work:

    sed -e 's/.* contig: \([[:digit:]]\).*/>\1/' fasta.fa
    >1
    >2
    

    Using a character class for the digit ([[:digit:]]), and capture groups (\( \) and reference that group with \1 in the replacement).