Search code examples
biopython

(Biopython) how can I increase the number of articles found using Entrez.esearch(retmax=N)?


I want to extract articles based on a key term and a specific period using biopython. However, for some reason, no matter how big the number I put in "retmax", the max number is 9999. When I looked up Pubmed manually, the number of articles found is way different, more than 9999.

from Bio import Entrez
from io import StringIO
import csv

# Set email and API key (replace 'youremail' and 'yourapikey' with your own)
Entrez.email = 'banana@gmail' #just dummy
Entrez.api_key = 'banana' #just dummy

# Define search terms and time window
search_term = 'cancer'
start_date = '2015/01/01'
end_date = '2016/08/30'

# Define search query
query = f'{search_term}[Title/Abstract] AND ("{start_date}"[Date - Publication] : "{end_date}"[Date - Publication])'

# Search PUBMED using the query
handle = Entrez.esearch(db='pubmed', term=query, retmax=10000)
record = Entrez.read(handle)
id_list = record['IdList']
len(id_list)

Please help me how to get the exact number of articles extracted from Pubmed using biopython.


Solution

  • This is what the ncbi esearch documentation that is posted in the biopython esearch documentation has to say about about increasing the number of retmax to more than 10000 records in PubMed:

    retmax

    Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.

    To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using <EDirect> that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.

    You can try to use the section "9.16.2 Searching for and downloading abstracts using the history" in the biopython cookbook to get all the records in chunks of 10000 using the the restart and retmax arguments.