I am an MPH student in an INTRO to data science class and have beginner knowledge of programming. I am running Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32 and using Pycharm as my IDE. I am building a webscraper using BioPython and then saving results in a dataframe.
The code for the scraping is this:
from Bio import Entrez
import pandas
# gives a list of Citation IDs in response to a search word
def search(query):
Entrez.email = 't@gmail.com'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='15',
retmode='xml',
datatype = 'pdat',
mindate = '2001/01/01',
maxdate = '2010/01/01',
term=(query)
)
results = Entrez.read(handle)
return results
# Fetch the details for all the retrieved articles via the fetch utility.
def fetch_details(id_list):
ids = ','.join(id_list)
Entrez.email = 't@gmail.com'
handle = Entrez.efetch(db='pubmed',
retmode='xml',
id=ids)
results = Entrez.read(handle)
return results
if __name__ == '__main__':
results = search('fever')
id_list = results['IdList']
papers = fetch_details(id_list)
And then to save to a dataframe, i have this:
pmid = []
title = []
pubyear = []
abstract = []
for i, paper in enumerate(papers['PubmedArticle']):
pm = paper['MedlineCitation']['PMID']
pmid.append(str(pm))
tit = paper['MedlineCitation']['Article']['ArticleTitle']
title.append(tit)
pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year']
pubyear.append(pbyr)
ab = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
str(ab)
abstract.append(str(ab))
# create empty dataframe
paper_df = pandas.DataFrame()
# add the PMID, Title, Publication Year, and Abstract columns
paper_df['Article_PMID'] = pmid
paper_df['Article_Title'] = title
paper_df['Publication_Year'] = pubyear
paper_df['Article_Abstract'] = abstract
My question becomes... When my retmax argument in the esearch function is only 15, it works just fine. I get 15 records, with all 4 pieces of information I need filled in. However, when I change it to 16, i get an error.
Traceback (most recent call last): File "C:/Users/lztp/Documents/UT/1_PHM_2193_Intro_to_Data_Science/PyCharm_Projects/FP_Crawler_Module_1.py", line 69, in pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year'] KeyError: 'Year'
My understanding is that it means 'Year' doesn't exist in the next record? How can I have it skip over records with missing values and only save the ones that have the values that I need? I tried using a filter in the term argument in esearch, but got another error. Is there a way to check if the value is empty or not? Or if anyone has ideas on how to go about doing this, it will be greatly appreciated.
for i, paper in enumerate(papers['PubmedArticle']):
try:
pm = paper['MedlineCitation']['PMID']
tit = paper['MedlineCitation']['Article']['ArticleTitle']
pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year']
ab = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
except KeyError as e:
continue
pmid.append(str(pm))
title.append(tit)
pubyear.append(pbyr)
abstract.append(str(ab))
Just use try-catch to handle.