Search code examples
linuxfasta

How to edit a header in a fasta sequence by cutting some parts of it and keeping the main text of the sequence using a linux command line?


I have a multi fasta file named fasta1.fasta that contains the sequences and their IDs. What i want is to cut the header of the sequence that have the ID and reduce it to contains the ID accession number of the sequence only. I used the command line grep '>' fasta1.fasta | cut -d " " -f 1 to cut the parts that i want from the header but the output that i get is the IDs accession numbers only without the rest of the sequences. My sequences looks like this:

>tr|Q8IBQ5|Q8IBQ5_PLAF7 40S ribosomal protein S10, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_$
MDKQTLPHHKYSYIPKQNKKLIYEYLFKEGVIVVEKDAKIPRHPHLNVPNLHIMMTLKSL
KSRNYVEEKYNWKHQYFILNNEGIEYLREFLHLPPSIFPATLSKKTVNRAPKMDEDISRD
VRQPMGRGRAFDRRPFE
>tr|Q8IEB1|Q8IEB1_PLAF7 TBC domain protein, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_132020$
MEYKLEFLSYLLIFKKKNERISKFDEQIKTCINIFEKSIINESDLKYLFERNILDMNPGV
RSMCWKLALKHLSLDSNKWNTELIEKKKLYEEYIKSFVINPYYSCVDNKKKEFVKETEKE
PKGKNMKDEYIEYNLDRNKTYYHKDDSLLKLQNDNNTKQMDYLEDEKYSSMDDECSEDNW

The output that i get is:

>tr|Q8IBQ5|Q8IBQ5_PLAF7

>tr|Q8IEB1|Q8IEB1_PLAF7

While the output desired is:

>tr|Q8IBQ5|Q8IBQ5_PLAF7
MDKQTLPHHKYSYIPKQNKKLIYEYLFKEGVIVVEKDAKIPRHPHLNVPNLHIMMTLKSL
KSRNYVEEKYNWKHQYFILNNEGIEYLREFLHLPPSIFPATLSKKTVNRAPKMDEDISRD
VRQPMGRGRAFDRRPFE
>tr|Q8IEB1|Q8IEB1_PLAF7
EYKLEFLSYLLIFKKKNERISKFDEQIKTCINIFEKSIINESDLKYLFERNILDMNPGV
RSMCWKLALKHLSLDSNKWNTELIEKKKLYEEYIKSFVINPYYSCVDNKKKEFVKETEKE
PKGKNMKDEYIEYNLDRNKTYYHKDDSLLKLQNDNNTKQMDYLEDEKYSSMDDECSEDNW

Any help will be appreciated. Thank you.


Solution

    • Variant 1:

      sed '/^>/s/ .*//'
      
    • Variant 2:

      perl -pe 's/ .*// if /^>/'
      

    That is, in all lines that start with >, remove everything after and including the first space.