I want to extract articles based on a key term and a specific period using biopython. However, for some reason, no matter how big the number I put in "retmax", the max number is 9999. When I looked up Pubmed manually, the number of articles found is way different, more than 9999.
from Bio import Entrez
from io import StringIO
import csv
# Set email and API key (replace 'youremail' and 'yourapikey' with your own)
Entrez.email = 'banana@gmail' #just dummy
Entrez.api_key = 'banana' #just dummy
# Define search terms and time window
search_term = 'cancer'
start_date = '2015/01/01'
end_date = '2016/08/30'
# Define search query
query = f'{search_term}[Title/Abstract] AND ("{start_date}"[Date - Publication] : "{end_date}"[Date - Publication])'
# Search PUBMED using the query
handle = Entrez.esearch(db='pubmed', term=query, retmax=10000)
record = Entrez.read(handle)
id_list = record['IdList']
len(id_list)
Please help me how to get the exact number of articles extracted from Pubmed using biopython.
This is what the ncbi esearch
documentation that is posted in the biopython esearch
documentation has to say about about increasing the number of retmax
to more than 10000
records in PubMed
:
retmax
Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If
usehistory
is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasingretmax
allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of
retstart
(see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using<EDirect>
that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.
You can try to use the section "9.16.2 Searching for and downloading abstracts using the history" in the biopython cookbook
to get all the records in chunks of 10000
using the the restart
and retmax
arguments.