Search code examples
pythonawkbioinformaticsfasta

Change fasta file header


I have a few fasta files and I want to change the headers

>XP_001267680.1 conserved hypothetical protein [Aspergillus clavatus NRRL 1]
MTEILARLTAPSAYRYASCEILEDYGRQLRELIAYIKQPRTTADIATAAEFLLDNLDPSLHSASY...
>XP_001267682.1 60S ribosomal protein L18 [Aspergillus clavatus NRRL 1]
MGIDLDRHHVRSTHRKAPKSENVYLQVLVKLYRFLSRRTESNFNKVVLRRLFMSRINRPPVS...
etc...

And I want to change the fasta file so it looks like this:

>Acla00001
MTEILARLTAPSAYRYASCEILEDYGRQLRELIAYIKQPRTTADIATAAEFLLDNLDPSLHSASY...
>Acla00002
MGIDLDRHHVRSTHRKAPKSENVYLQVLVKLYRFLSRRTESNFNKVVLRRLFMSRINRPPVS...
...
>Acla03871
MTEILARLTAPSAYRYASCEILEDYGRQLRELIAYIKQPRTTADIATAAEFLLDNLDPSLHSASYLF...
>Acla03872
MGIDLDRHHVRSTHRKAPKSENVYLQVLVKLYRFLSRRTESNFNKVVLRRLFMSRINRPPVSL...

I found this piece of code to remove everything if the line starts with > and add a new > and the organism name + a number.

org = 'Acla'    
os.popen("""cat %s.fa | awk '/^>/{print ">%s" ++i; next}{print}'""" % (org, org)).read()

I want the length of all these lines to be equal by adding zeros so the number is 5 digits or the total length of the string is 10.


Solution

  • change print statement to

     /^>/{printf ">Acla%05d\n",++i ...