Search code examples
rubybioinformaticstext-miningbiopythonpubmed

How to download full article text from Pubmed?


I am working on a project that requires to work with Genia corpus. According to the literature Genia Corpus is made from articles extracted by searching 3 Mesh terms : “transcription factor”, “blood cell” and “human” on Medline/Pubmed. I want to extract full text article(which are freely available) for the articles in Genia corpus from Pubmed. I have tried many approaches but I am not able to find a way to download full text in text or XML or Pdf format.

Using Entrez utils provided by NCBI :

  1. I have tried using the approach mentioned here - http://www.hpa-bioinformatics.org.uk/bioruby-api/classes/Bio/NCBI/REST/EFetch/Methods.html#M002197

    which uses the Ruby gem Bio like this to get the information for a given PubMed ID - Bio::NCBI::REST::EFetch.pubmed(15496913)

    But, it doesn't return the full text for the PMID.

  2. Internally, it makes a call like this - http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=1372388&retmode=text&rettype=medline

    But, both the Ruby gem and the above call don't return the full text.

  3. On further Internet search, I found that the allowed values for PubMed for rettype and retmode don't have an option to get the full text, as mentioned in the table here - http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

  4. All the examples and other scripts I have seen on the Internet are only about extracting abstracts. authors etc. and none of them discuss extracting the full text.

  5. Here is another link that I found that uses Python package Bio, but only accesses the information about authors - https://www.biostars.org/p/172296/

How can I download full text of the article in text or XML or Pdf format using Entrez utils provided by NCBI? Or are there already available scripts or web crawlers that I can use?


Solution

  • You can use biopython to get articles which are on PubMedCentral and then get PDF from it. For all articles which are hosted somewhere else, it is difficult to get a generic solution to get the PDF.

    It seems that PubMedCentral does not want you to download articles in bulk. Requests via urllib are blocked, but the same URL works from a browser.

    from Bio import Entrez
    
    Entrez.email = "[email protected]"
    
    
    #id is a string list with pubmed IDs
    #two of have a public PMC article, one does not
    handle = Entrez.efetch("pubmed", id="19304878,19088134", retmode="xml")
    
    records = Entrez.parse(handle)
    #checks for all records if they have a PMC identifier
    #prints the URL for downloading the PDF
    for record in records:
        if record.get('MedlineCitation'):
            if record['MedlineCitation'].get('OtherID'):
               for other_id in record['MedlineCitation']['OtherID']:
                   if other_id.title().startswith('Pmc'):
                       print('http://www.ncbi.nlm.nih.gov/pmc/articles/%s/pdf/' % (other_id.title().upper()))