Search code examples
awksedfasta

Trim FASTA headers with sed


I have a reference genome containing the following headers (lines starting with >) that I would like to be renamed to simply the digit/letter of the chromosomes. I would like a sed statement to do this systematic replacement, but I am new to sed. Elsewhere in the file are additional headers that should be unchanged, and the genetic sequences between the headers should remain unchanged.

>ST078050.1 Ovis aries is a sheep chromosome 1, whole genome shotgun sequence
>ST078051.1 Ovis aries is a sheep chromosome 2, whole genome shotgun sequence
>ST078052.1 Ovis aries is a sheep chromosome 3, whole genome shotgun sequence
>ST078053.1 Ovis aries is a sheep chromosome 4, whole genome shotgun sequence
>ST078054.1 Ovis aries is a sheep chromosome 5, whole genome shotgun sequence
>ST078055.1 Ovis aries is a sheep chromosome 6, whole genome shotgun sequence
>ST078056.1 Ovis aries is a sheep chromosome 7, whole genome shotgun sequence
>ST078057.1 Ovis aries is a sheep chromosome 8, whole genome shotgun sequence
>ST078058.1 Ovis aries is a sheep chromosome 9, whole genome shotgun sequence
>ST078059.1 Ovis aries is a sheep chromosome 10, whole genome shotgun sequence
>ST078079.1 Ovis aries is a sheep chromosome X, whole genome shotgun sequence
>ST078080.1 Ovis aries is a sheep chromosome Y, whole genome shotgun sequence

Output should be:

>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>X
>Y

I tried the following, but it's not right.

sed 's/^.*\(chromosome.*,\).*$/\1/' file

Thank you!


Solution

  • Assuming that the above are just some headers of actual fasta files, and the remaining sequence is still in the files, then the following solutions will do the job:

    $ sed '/^>/{s/,.*//;s/^.* />/}' file.fasta
    $ awk '/^>/{sub(/,.*$/,"");$0=">"$NF}1' file.fasta
    

    Both methods do exactly the same. In the line that starts with a >, remove the string starting with a , till the end and replace everything upto the last space with a >. The latter is done in awk by simple calling the last field.