Search code examples
sedfasta

sed: keep certain contents for matched lines


I have numerous sequences in one fasta file like the one below (downloaded from UniProtKB):

>sp|P00045|CYC7_YEAST Cytochrome c iso-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=CYC7 PE=1 SV=1
MAKESTGFKPGSAKKGATLFKTRCQQCHTIEEGGPNKVGPNLHGIFGRHSGQVKGYSYTD
ANINKNVKWDEDSMSEYLTNPKKYIPGTKMAFAGLKKEKDRNDLITYMTKAAK

Since they are all amino acid sequences for cytochrome c, I care only about the organism (i.e. Saccharomyces cerevisiae for the above entry). So I wish to modify headers of these sequences as below:

>Saccharomyces cerevisiae
MAKESTGFKPGSAKKGATLFKTRCQQCHTIEEGGPNKVGPNLHGIFGRHSGQVKGYSYTD
ANINKNVKWDEDSMSEYLTNPKKYIPGTKMAFAGLKKEKDRNDLITYMTKAAK

Organism names always come after "OS=" and stop when either one of:

  1. space(.*) # strain information
  2. space..=

is met.

So could anybody give me some clues on how to make it? Thx!


Solution

  • You can use this:

    sed 's/.*OS=\([^(]*\).*/>\1/' input