I have some code which strips a species name from a list with underscores in, to a format appropriate for the NCBI, this then searches for the ID associated with that species name, however for some reason this isn't working with every entry in my input file. I have attached my code, a subset of the input file and a subset of the output file.
from Bio import Entrez
import time
Entrez.email = '[email protected]'
def get_tax_id(species):
species = species.replace('_', '+').strip()
search = Entrez.esearch(term=species, db='taxonomy', retmode='xml')
record = Entrez.read(search)
return record['IdList']
current_time = time.strftime("%d.%m.%y %H:%M", time.localtime())
output_name = 'test#%s.txt' % current_time
file = open(output_name, "w+")
listoforganisms = [x.split('\t')[0] for x in open("OGTlist.csv").readlines()]
if __name__ == '__main__':
organisms = listoforganisms
for organism in organisms:
taxid = get_tax_id(organism)
stringid = str(taxid)
strippedid = stringid.strip("'[]'")
if len(stringid) <= 2:
file.write('\n' + str(organism) + ',ERROR_no_ID_match')
else:
file.write('\n' + str(organism) + ',' + str(strippedid))
So this code prints a results file, and if the conversion works, prints the species name and the ID, and if not it just prints an error, my results file looks like this:
micromonospora_inyonensis,47866
viola_arvensis,97415
amycolatopsis_albidoflavus,102226
tetragenococcus_koreensis,290335
panaeolus_papilionaceus,330517
geomys_pinetis,100306
vibrio_lutjanus,ERROR_no_ID_match
succiniclasticum_ruminis,40841
microtetraspora_malaysiensis,161358
blarina_carolinensis,183658
amycolatopsis_palatopharyngis,187982
rhodosporidium_toruloides,5286
geobacter_bemidjiensis,225194
acinetobacter_haemolyticus,29430
actinoplanes_tereljensis,571912
phyllostomus_hastatus,9423
phacidium_infestans,66518
dorea_formicigenerans,39486
hoeflea_marina,274592
naemacyclus_minor,64355
methanosaeta_thermophila,2224
pholiota_carbonaria,227966
sphingomonas_faeni,185950
helicobacter_pullorum,35818
solitalea_koreensis,543615
dermacoccus_profundi,322602
pseudomonas_pictorum,86184
actinomadura_livida,79909
leptonycteris_curasoae,55054
psychrobacter_salsus,219741
vibrio_inusitatus,413402
stereum_rameale,ERROR_no_ID_match
photorhabdus_temperata,574560
clitocybe_lignatilis,5634
actinocorallia_glomerata,46203
aspergillus_giganteus,5060
erwinia_amylovora,552
hydrogenoanaerobacterium_saccharovorans,474960
mycobacterium_aichiense,1799
nocardia_pneumoniae,228601
bacillus_pocheonensis,363869
streptomonospora_alba,183763
exobasidium_gracile,190086
phenylobacterium_zucineum,284016
amsonia_tabernaemontana,144544
rattus_fuscipes,10119
jannaschia_rubra,282197
hereroa_rehneltiana,ERROR_no_ID_match
The file I'm getting the species names from looks like this:
micromonospora_inyonensis 28 DSMZ
viola_arvensis 23 DSMZ
amycolatopsis_albidoflavus 28 DSMZ
tetragenococcus_koreensis 28 DSMZ
panaeolus_papilionaceus 24 DSMZ
geomys_pinetis 36.3 white
vibrio_lutjanus 30 DSMZ
succiniclasticum_ruminis 37 DSMZ
microtetraspora_malaysiensis 28 DSMZ
blarina_carolinensis 36.8 white
amycolatopsis_palatopharyngis 28 DSMZ
rhodosporidium_toruloides 23 DSMZ
geobacter_bemidjiensis 30 DSMZ
acinetobacter_haemolyticus 28 DSMZ
actinoplanes_tereljensis 28 DSMZ
phyllostomus_hastatus 34.7 white
phacidium_infestans 25 DSMZ
dorea_formicigenerans 37 DSMZ
hoeflea_marina 28 DSMZ
naemacyclus_minor 22 DSMZ
methanosaeta_thermophila 58.3333333333 DSMZ
pholiota_carbonaria 25 DSMZ
sphingomonas_faeni 22 DSMZ
helicobacter_pullorum 37 DSMZ
solitalea_koreensis 28 DSMZ
dermacoccus_profundi 28 DSMZ
pseudomonas_pictorum 28 DSMZ
actinomadura_livida 28 DSMZ
leptonycteris_curasoae 35.7 white
psychrobacter_salsus 22 DSMZ
vibrio_inusitatus 28 DSMZ
stereum_rameale 20 DSMZ
photorhabdus_temperata 28.6666666667 DSMZ
clitocybe_lignatilis 25 DSMZ
actinocorallia_glomerata 28 DSMZ
aspergillus_giganteus 24.5 DSMZ
erwinia_amylovora 26.6666666667 DSMZ
hydrogenoanaerobacterium_saccharovorans 37 DSMZ
mycobacterium_aichiense 37 DSMZ
nocardia_pneumoniae 28 DSMZ
bacillus_pocheonensis 30 DSMZ
streptomonospora_alba 28 DSMZ
exobasidium_gracile 20 DSMZ
phenylobacterium_zucineum 30 DSMZ
amsonia_tabernaemontana 23 DSMZ
rattus_fuscipes 37.5 white
jannaschia_rubra 25 DSMZ
hereroa_rehneltiana 23 DSMZ
My actual input file has about 2000 entries, is the answer is as simple as the species names are incorrect or that IDs don't exist on the NCBI for all the species, does anyone have a solution to overcome this programmatically?
The first answer is that the species names does not exist. You can check that on the ncbi website. like here: https://www.ncbi.nlm.nih.gov/search/?term=Stereum+rameale
https://www.ncbi.nlm.nih.gov/search/?term=vibrio_lutjanus
Vibrio lutjanus seems not existing anyways if you look at other websites. For example https://www.arb-silva.de/search/ or
There is no solution to overcome this (in case of finding taxon id's), but you could do a double check if the name is right. Taxonomy is difficult, every body gives a different name and there are lots of synonyms. You can use the api's of taxonomic name website's like gbif or global names.
[EDIT]
You can also check the taxon id of the genus if species is not available. Here you can download the taxonomy information of the NCBI:
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/
You need to download the zip file and probably need the files rankedlineage.dmp and merged.dmp The global names website can also be used for genus level. Dont know if entrez from BioPython can look up id's of genus level maybe that is also an option.