I have numerous sequences in one fasta file like the one below (downloaded from UniProtKB):
>sp|P00045|CYC7_YEAST Cytochrome c iso-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=CYC7 PE=1 SV=1
MAKESTGFKPGSAKKGATLFKTRCQQCHTIEEGGPNKVGPNLHGIFGRHSGQVKGYSYTD
ANINKNVKWDEDSMSEYLTNPKKYIPGTKMAFAGLKKEKDRNDLITYMTKAAK
Since they are all amino acid sequences for cytochrome c, I care only about the organism (i.e. Saccharomyces cerevisiae for the above entry). So I wish to modify headers of these sequences as below:
>Saccharomyces cerevisiae
MAKESTGFKPGSAKKGATLFKTRCQQCHTIEEGGPNKVGPNLHGIFGRHSGQVKGYSYTD
ANINKNVKWDEDSMSEYLTNPKKYIPGTKMAFAGLKKEKDRNDLITYMTKAAK
Organism names always come after "OS=" and stop when either one of:
is met.
So could anybody give me some clues on how to make it? Thx!
You can use this:
sed 's/.*OS=\([^(]*\).*/>\1/' input