(Disclaimer: I'm a Python and web scraping noob, but I'm doing my best to learn).
I'm trying to extract 3 key data points from research studies on clinicaltrials.gov. They have an API, but the API doesn't capture the things I need. I want to get a (1) short description of the study, (2) the Principal Investigator (PI), and (3) some keywords associated with the study. I believe my code captures 1 and 3, but not 2. I can't seem to figure out why I'm not getting the Principal Investigator(s) name. Here are the two sites I have in my code:
https://clinicaltrials.gov/ct2/show/NCT03530579 https://clinicaltrials.gov/ct2/show/NCT03436992
Here's my code (I know the PI code is wrong, but I wanted to demonstrate that I tried) :
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
fields=['PI','Project_Summary', 'Keywords']
with open(r'test.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(fields)
urls = ['https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/ct2/show/NCT03530579']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
#get_keywords
for rows in soup.find_all("td"):
k = rows.get_text()
Keywords = k.strip()
#get Principal Investigator
PI = soup.find_all('padding:1ex 1em 0px 0px;white-space:nowrap;')
#Get description
Description = soup.find(class_='ct-body3 tr-indent2').get_text()
d = {'Summary2':[PI,Description,Keywords]}
df = pd.DataFrame(d)
print (df)
import csv
fields=[PI,Description, Keywords]
with open(r'test.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(fields)
You may be able to use the following selector
i.e. PI = soup.select_one('.tr-table_cover [headers=name]').text
import requests
from bs4 import BeautifulSoup
urls = ['https://clinicaltrials.gov/ct2/show/NCT03530579', 'https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/show/NCT03834376']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = BeautifulSoup(r.text, "lxml")
item = soup.select_one('.tr-table_cover [headers=name]').text if soup.select_one('.tr-table_cover [headers=name]') is not None else 'No PI'
print(item)
The .
is a class selector and the []
is an attribute selector. The space between is a descendant combinator
specifying that the element retrieved on the right is a child of that on the left