I have a reference genome containing the following headers (lines starting with >) that I would like to be renamed to simply the digit/letter of the chromosomes. I would like a sed statement to do this systematic replacement, but I am new to sed. Elsewhere in the file are additional headers that should be unchanged, and the genetic sequences between the headers should remain unchanged.
>ST078050.1 Ovis aries is a sheep chromosome 1, whole genome shotgun sequence
>ST078051.1 Ovis aries is a sheep chromosome 2, whole genome shotgun sequence
>ST078052.1 Ovis aries is a sheep chromosome 3, whole genome shotgun sequence
>ST078053.1 Ovis aries is a sheep chromosome 4, whole genome shotgun sequence
>ST078054.1 Ovis aries is a sheep chromosome 5, whole genome shotgun sequence
>ST078055.1 Ovis aries is a sheep chromosome 6, whole genome shotgun sequence
>ST078056.1 Ovis aries is a sheep chromosome 7, whole genome shotgun sequence
>ST078057.1 Ovis aries is a sheep chromosome 8, whole genome shotgun sequence
>ST078058.1 Ovis aries is a sheep chromosome 9, whole genome shotgun sequence
>ST078059.1 Ovis aries is a sheep chromosome 10, whole genome shotgun sequence
>ST078079.1 Ovis aries is a sheep chromosome X, whole genome shotgun sequence
>ST078080.1 Ovis aries is a sheep chromosome Y, whole genome shotgun sequence
Output should be:
>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>X
>Y
I tried the following, but it's not right.
sed 's/^.*\(chromosome.*,\).*$/\1/' file
Thank you!
Assuming that the above are just some headers of actual fasta files, and the remaining sequence is still in the files, then the following solutions will do the job:
$ sed '/^>/{s/,.*//;s/^.* />/}' file.fasta
$ awk '/^>/{sub(/,.*$/,"");$0=">"$NF}1' file.fasta
Both methods do exactly the same. In the line that starts with a >
, remove the string starting with a ,
till the end and replace everything upto the last space with a >
. The latter is done in awk by simple calling the last field.