Search code examples
pythonerror-handlingweb-scrapingdata-extraction

Handle Key-Error whilst scraping


I am currently working on a script to scrape data from ClinicalTrials.gov. To do this I have written the following script:

def clinicalTrialsGov (id):
    url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
    data = BeautifulSoup(requests.get(url).text, "lxml")
    studyType = data.study_type.text
    if studyType == 'Interventional':
        allocation = data.allocation.text
        interventionModel = data.intervention_model.text
        primaryPurpose = data.primary_purpose.text
        masking = data.masking.text
        enrollment = data.enrollment.text
    officialTitle = data.official_title.text
    condition = data.condition.text
    minAge = data.eligibility.minimum_age.text
    maxAge = data.eligibility.maximum_age.text
    gender = data.eligibility.gender.text
    healthyVolunteers = data.eligibility.healthy_volunteers.text
    armType = []
    intType = []
    for each in data.findAll('intervention'):
        intType.append(each.intervention_type.text)
    for each in data.findAll('arm_group'):
        armType.append(each.arm_group_type.text)
    citedPMID = tryExceptCT(data, '.results_reference.PMID')
    citedPMID = data.results_reference.PMID
    print(citedPMID)
    return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType

However, the following script won't always work as not all studies will have all items (ie. a KeyError will occur). To resolve this, I could simply wrap each statement in a try-except catch, like this:

try:
  studyType = data.study_type.text
except:
  studyType = ""

but it seems a bad way to implement this. What's a better/cleaner solution?


Solution

  • This is a good question. Before I address it, let me say that you should consider changing the second parameter to the BeautifulSoup (BS) constructor from lxml to xml. Otherwise, BS does not flag the parsed markup as XML (to verify this for yourself, access the is_xml attribute on the data variable in your code).

    You can avoid generating an error when attempting to access a non-existent element by passing a list of desired element names to the find_all() method:

    subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']
    
    tag_matches = data.find_all(subset)
    

    Then, if you want to get a specific element from the list of Tags without iterating through it, you can convert it to a dict using the Tag names as keys:

    tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))