Search code examples
unixsedbioinformaticsfasta

Use sed to delete everything after '>' and add index number plus a string?


I know this should be pretty simple to do, but I can't get it to work. My file looks like this

>c12345|random info goes here that I want to delete
AAAAATTTTTTTTCCCC
>c45678| more | random info|  here
GGGGGGGGGGG

And what I want to do is just make this far simpler so it might look like this

>seq1 [organism=human]
AAAAATTTTTTTTCCCC
>seq2 [organism=human]
GGGGGGGGGGGG
>seq3 [organism=human]
etc....

I know I can append that constant easily once I get the indexed part in there by doing:

sed '/^>/ s/$/\[organism-human]/g'

But how do I get that index built?


Solution

  • With sed:

    sed '/^>/d' filename | sed '=' | sed 's/^[0-9]*$/>seq& [organism=human]/'
    

    (Thanks to NeronLeVelu for the simplification.)