Search code examples
pythonpandasxmlbeautifulsouptext-parsing

Python / BeautifulSoup - Extracting XML data from clinicaltrials.gov, only able to extract studies that don't have missing data


I am using clinicaltrials.gov's API to obtain a list of clinical trial data into a XML file and then I am parsing the data to eventually export into an Excel dataset.

In the URL provided in my code, there are 9 results, however my code is only pulling data for 5/9. I've realized it's because that for one of the fields (detaileddescription), only some of the trials have this data. When I remove detaileddescription and just use the two other fields (nctid and briefdescription), I am able to get 9/9. What can I do here besides doing something messy like creating a separate dataframe for detaileddescription and merging?

Bottom line: I am extracting 3 fields from an XML file that includes 9 clinical trials: nctid, briefsummary, and detaileddescription, but my output is only extracting 5/9 clinical trials. How can my output get all 9/9 without taking out the detaileddescription field from my output?

import requests
from bs4 import BeautifulSoup
import pandas as pd

out = []
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=50&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
nctids = soup.find_all("field", {"name" : "NCTId"})
briefsummaries = soup.find_all("field", {"name" : "BriefSummary"}) if soup.find_all("field", {"name" : "BriefSummary"}) is not None else 'nothing'
detaileddescriptions = soup.find_all("field", {"name" : "DetailedDescription"}) if soup.find_all("field", {"name" : "DetailedDescription"}) is not None else 'nothing'

for nctid, briefsummary, detaileddescription in zip(nctids, briefsummaries, detaileddescriptions):
    
    data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
    out.append(data)
df = pd.DataFrame(out)

df.to_excel('clinicaltrialstresults.xlsx')

Solution

  • You can try looping over the study list with slight changes to your code

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    
    out = []
    url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=50&fmt=xml'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    study_list = soup.find_all("fullstudy")
    
    for study in study_list:
        nctid = study.find("field", {"name" : "NCTId"})
        briefsummary = study.find("field", {"name" : "BriefSummary"}) if study.find("field", {"name" : "BriefSummary"}) is not None else 'nothing'
        detaileddescription = study.find("field", {"name" : "DetailedDescription"}) if study.find("field", {"name" : "DetailedDescription"}) is not None else 'nothing'
        data = {'nctid': nctid, 'briefsummary': briefsummary, 'detaileddescription': detaileddescription}
        out.append(data)
    
    df = pd.DataFrame(out)
    df.to_excel('clinicaltrialstresults.xlsx', index=False)