Search code examples
pythonbiopythonpubmed

Biopython's ESearch does not give me full IdList


I am trying to search for some articles by using the following code:

handle = Entrez.esearch(db="pubmed", term="lung+cancer")
record = Entrez.read(handle)

From record['Count'] I can see there are 293279 results, but when I look at record['IdList'] it only gives me 20 Id's. Why is that? How do I get all the 293279 records?


Solution

  • The default number of records that Entrez.esearch returns is 20. This is to prevent overloading NCBI's servers. To get the full list of records, change the retmax parameter:

    >>> from Bio import Entrez
    >>> Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
    >>> handle = Entrez.esearch(db="pubmed", term="lung+cancer")
    >>> record = Entrez.read(handle)
    >>> count = record['Count']
    >>> handle = Entrez.esearch(db="pubmed", term="lung+cancer", retmax=count)
    >>> record = Entrez.read(handle)
    >>> print len(record['IdList'])
    293279 
    

    The way to download all the records is to use Entrez.epost.

    From chapter 9.4 of the BioPython tutorial:

    EPost uploads a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through the Bio.Entrez.epost() function.

    To give an example of when this is useful, suppose you have a long list of IDs you want to download using EFetch (maybe sequences, maybe citations – anything). When you make a request with EFetch your list of IDs, the database etc, are all turned into a long URL sent to the server. If your list of IDs is long, this URL gets long, and long URLs can break (e.g. some proxies don’t cope well).

    Instead, you can break this up into two steps, first uploading the list of IDs using EPost (this uses an “HTML post” internally, rather than an “HTML get”, getting round the long URL problem). With the history support, you can then refer to this long list of IDs, and download the associated data with EFetch.

    [...] The returned XML includes two important strings, QueryKey and WebEnv which together define your history session. You would extract these values for use with another Entrez call such as EFetch.

    Read chapter 9.15.: Searching for and downloading sequences using the history to learn how to use QueryKey and WebEnv

    A full working example would then be:

    from Bio import Entrez
    import time
    
    Entrez.email = "A.N.Other@example.com" 
    handle = Entrez.esearch(db="pubmed", term="lung+cancer")
    record = Entrez.read(handle)
    
    count = int(record['Count'])
    handle = Entrez.esearch(db="pubmed", term="lung+cancer", retmax=count)
    record = Entrez.read(handle)
    
    id_list = record['IdList']
    post_xml = Entrez.epost("pubmed", id=",".join(id_list))
    search_results = Entrez.read(post_xml)
    
    webenv = search_results["WebEnv"]
    query_key = search_results["QueryKey"] 
    
    try:
        from urllib.error import HTTPError  # for Python 3
    except ImportError:
        from urllib2 import HTTPError  # for Python 2
    
    batch_size = 200
    out_handle = open("lung_cancer.txt", "w")
    for start in range(0, count, batch_size):
        end = min(count, start+batch_size)
        print("Going to download record %i to %i" % (start+1, end))
        attempt = 0
        success = False
        while attempt < 3 and not success:
            attempt += 1
            try:
                fetch_handle = Entrez.efetch(db="pubmed",
                                             retstart=start, retmax=batch_size,
                                             webenv=webenv, query_key=query_key)
                success = True
            except HTTPError as err:
                if 500 <= err.code <= 599:
                    print("Received error from server %s" % err)
                    print("Attempt %i of 3" % attempt)
                    time.sleep(15)
                else:
                    raise
        data = fetch_handle.read()
        fetch_handle.close()
        out_handle.write(data)
    out_handle.close()