Search code examples
pythonbiopythonpubmed

BioPython KeyError


I am an MPH student in an INTRO to data science class and have beginner knowledge of programming. I am running Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32 and using Pycharm as my IDE. I am building a webscraper using BioPython and then saving results in a dataframe.
The code for the scraping is this:

from Bio import Entrez
import pandas

# gives a list of Citation IDs in response to a search word
def search(query):
    Entrez.email = 't@gmail.com'
    handle = Entrez.esearch(db='pubmed',
                            sort='relevance',
                            retmax='15',
                            retmode='xml',
                            datatype = 'pdat',
                            mindate = '2001/01/01',
                            maxdate = '2010/01/01',
                            term=(query)
                            )
    results = Entrez.read(handle)
    return results

# Fetch the details for all the retrieved articles via the fetch utility.
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 't@gmail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

if __name__ == '__main__':
    results = search('fever')
    id_list = results['IdList']
    papers = fetch_details(id_list)

And then to save to a dataframe, i have this:

pmid = []
title = []
pubyear = []
abstract = []

for i, paper in enumerate(papers['PubmedArticle']):
    pm = paper['MedlineCitation']['PMID']
    pmid.append(str(pm))
    tit = paper['MedlineCitation']['Article']['ArticleTitle']
    title.append(tit)
    pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year']
    pubyear.append(pbyr)
    ab = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
    str(ab)
    abstract.append(str(ab))

# create empty dataframe
paper_df = pandas.DataFrame()

# add the PMID, Title, Publication Year, and Abstract columns
paper_df['Article_PMID'] = pmid
paper_df['Article_Title'] = title
paper_df['Publication_Year'] = pubyear
paper_df['Article_Abstract'] = abstract

My question becomes... When my retmax argument in the esearch function is only 15, it works just fine. I get 15 records, with all 4 pieces of information I need filled in. However, when I change it to 16, i get an error.

Traceback (most recent call last): File "C:/Users/lztp/Documents/UT/1_PHM_2193_Intro_to_Data_Science/PyCharm_Projects/FP_Crawler_Module_1.py", line 69, in pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year'] KeyError: 'Year'

My understanding is that it means 'Year' doesn't exist in the next record? How can I have it skip over records with missing values and only save the ones that have the values that I need? I tried using a filter in the term argument in esearch, but got another error. Is there a way to check if the value is empty or not? Or if anyone has ideas on how to go about doing this, it will be greatly appreciated.


Solution

  • for i, paper in enumerate(papers['PubmedArticle']):
        try:
            pm = paper['MedlineCitation']['PMID']
            tit = paper['MedlineCitation']['Article']['ArticleTitle']
            pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year']
            ab = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
        except KeyError as e:
            continue
        pmid.append(str(pm))
        title.append(tit)
        pubyear.append(pbyr)
        abstract.append(str(ab))
    

    Just use try-catch to handle.