Search code examples
pythonsearchbiopythonlarge-data

Biopython (or just Python in general): Most Efficient Way to Parse Species Name From A large .fasta file using gi identifier


I have a .fasta file (.txt essentiallly) of about 145000 entries that are formatted as below

>gi|393182|gb|AAA40101.1| cytokine [Mus musculus]
MDAKVVAVLALVLAALCISDGKPVSLSYRCPCRFFESHIARANVKHLKILNTPNCALQIVARLKNNNRQV
CIDPKLKWIQEYLEKALNKRLKM

>gi|378792467|pdb|3UNH|Y Chain Y, Mouse 20s Immunoproteasome
TTTLAFKFQHGVIVAVDSRATAGSYISSLRMNKVIEINPYLLGTMSGCAADCQYWERLLAKECRLYYLRN
GERISVSAASKLLSNMMLQYRGMGLSMGSMICGWDKKGPGLYYVDDNGTRLSGQMFSTGSGNTYAYGVMD
SGYRQDLSPEEAYDLGRRAIAYATHRDNYSGGVVNMYHMKEDGWVKVESSDVSDLLYKYGEAAL

>gi|378792462|pdb|3UNH|T Chain T, Mouse 20s Immunoproteasome
MSSIGTGYDLSASTFSPDGRVFQVEYAMKAVENSSTAIGIRCKDGVVFGVEKLVLSKLYEEGSNKRLFNV
DRHVGMAVAGLLADARSLADIAREEASNFRSNFGYNIPLKHLADRVAMYVHAYTLYSAVRPFGCSFMLGS
YSANDGAQLYMIDPSGVSYGYWGCAIGKARQAAKTEIEKLQMKEMTCRDVVKEVAKIIYIVHDEVKDKAF
ELELSWVGELTKGRHEIVPKDIREEAEKYAKESLKEEDESDDDNM
  1. I have a list of gi's (the first number listed after the |).
  2. The size of this list varies between 60 - 600 gi's for a given test
  3. I want to return a list with respective species of those gi's
  4. The species name is usually seen as in the first example (surrounded by square brackets [Mus musculus]) it is not always present.
  5. Order is not particularly important.

I have been using various BioPython parsing bits and pieces but I think because of the size of the search it fails. I was hoping someone on here would know of a more efficient way?

Thanks in advance!


Solution

  • Rather than parsing the not entirely consistent FASTA header line for the species, you could just extract the GI number and then look up the NCBI taxonomy ID, e.g. see http://lists.open-bio.org/pipermail/biopython/2009-June/005304.html - and from the taxid you can get the species name, common name, lineage etc. See ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt or if you prefer an online solution, the Entrez Utilities (EUtils) are another option.