Search code examples
pythonweb-scrapingbeautifulsouphtml-parsingexport-to-csv

Python web scraping with beautifulsoup - can't extract Principal Investigator from Clinicaltrials.gov


(Disclaimer: I'm a Python and web scraping noob, but I'm doing my best to learn).

I'm trying to extract 3 key data points from research studies on clinicaltrials.gov. They have an API, but the API doesn't capture the things I need. I want to get a (1) short description of the study, (2) the Principal Investigator (PI), and (3) some keywords associated with the study. I believe my code captures 1 and 3, but not 2. I can't seem to figure out why I'm not getting the Principal Investigator(s) name. Here are the two sites I have in my code:

https://clinicaltrials.gov/ct2/show/NCT03530579 https://clinicaltrials.gov/ct2/show/NCT03436992

Here's my code (I know the PI code is wrong, but I wanted to demonstrate that I tried) :

import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv   

fields=['PI','Project_Summary', 'Keywords']
with open(r'test.csv', 'a') as f:
     writer = csv.writer(f)
     writer.writerow(fields)

urls = ['https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/ct2/show/NCT03530579']
for url in urls:

     response = requests.get(url)
     soup = BeautifulSoup(response.content, 'html.parser')
     #get_keywords
     for rows in soup.find_all("td"):
          k = rows.get_text()     
          Keywords = k.strip()
     #get Principal Investigator   
     PI = soup.find_all('padding:1ex 1em 0px 0px;white-space:nowrap;')

     #Get description    
     Description = soup.find(class_='ct-body3 tr-indent2').get_text()
     d = {'Summary2':[PI,Description,Keywords]} 

     df = pd.DataFrame(d)
     print (df)
     import csv   
     fields=[PI,Description, Keywords]
     with open(r'test.csv', 'a') as f:
          writer = csv.writer(f)
          writer.writerow(fields)

Solution

  • You may be able to use the following selector

    i.e. PI = soup.select_one('.tr-table_cover [headers=name]').text

    import requests
    from bs4 import BeautifulSoup  
    urls = ['https://clinicaltrials.gov/ct2/show/NCT03530579', 'https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/show/NCT03834376']
    with requests.Session() as s:
        for url in urls:
            r = s.get(url)
            soup = BeautifulSoup(r.text, "lxml")
            item = soup.select_one('.tr-table_cover [headers=name]').text if soup.select_one('.tr-table_cover [headers=name]') is not None else 'No PI'
            print(item)
    

    The . is a class selector and the [] is an attribute selector. The space between is a descendant combinator specifying that the element retrieved on the right is a child of that on the left