I have a .fasta file (.txt essentiallly) of about 145000 entries that are formatted as below
>gi|393182|gb|AAA40101.1| cytokine [Mus musculus]
MDAKVVAVLALVLAALCISDGKPVSLSYRCPCRFFESHIARANVKHLKILNTPNCALQIVARLKNNNRQV
CIDPKLKWIQEYLEKALNKRLKM
>gi|378792467|pdb|3UNH|Y Chain Y, Mouse 20s Immunoproteasome
TTTLAFKFQHGVIVAVDSRATAGSYISSLRMNKVIEINPYLLGTMSGCAADCQYWERLLAKECRLYYLRN
GERISVSAASKLLSNMMLQYRGMGLSMGSMICGWDKKGPGLYYVDDNGTRLSGQMFSTGSGNTYAYGVMD
SGYRQDLSPEEAYDLGRRAIAYATHRDNYSGGVVNMYHMKEDGWVKVESSDVSDLLYKYGEAAL
>gi|378792462|pdb|3UNH|T Chain T, Mouse 20s Immunoproteasome
MSSIGTGYDLSASTFSPDGRVFQVEYAMKAVENSSTAIGIRCKDGVVFGVEKLVLSKLYEEGSNKRLFNV
DRHVGMAVAGLLADARSLADIAREEASNFRSNFGYNIPLKHLADRVAMYVHAYTLYSAVRPFGCSFMLGS
YSANDGAQLYMIDPSGVSYGYWGCAIGKARQAAKTEIEKLQMKEMTCRDVVKEVAKIIYIVHDEVKDKAF
ELELSWVGELTKGRHEIVPKDIREEAEKYAKESLKEEDESDDDNM
I have been using various BioPython parsing bits and pieces but I think because of the size of the search it fails. I was hoping someone on here would know of a more efficient way?
Thanks in advance!
Rather than parsing the not entirely consistent FASTA header line for the species, you could just extract the GI number and then look up the NCBI taxonomy ID, e.g. see http://lists.open-bio.org/pipermail/biopython/2009-June/005304.html - and from the taxid you can get the species name, common name, lineage etc. See ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt or if you prefer an online solution, the Entrez Utilities (EUtils) are another option.