My code access a page where each row may or may not have a drop down where more information exists.
I have a try and except statement to check for this.
Works fine in line 1, but not line 2?
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
gg=[]
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')
for session in sessions:
jj=(session.select_one('h4').text)
print(jj)
sub_session = session.select('.sub_accordin_presentation')
try:
if sub_session:
kk=([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
print(kk)
except:
kk=' '
dict={"Title":jj,"Sub":kk}
gg.append(dict)
df=pd.DataFrame(gg)
df.to_csv('test2.csv')
To get all sections + sub-sections, try:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get(
"https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2"
)
soup = bs(r.text, "lxml")
sessions = soup.select("#accordin > ul > li")
gg = []
for session in sessions:
jj = session.h4.get_text(strip=True, separator=" ")
sub_sessions = session.select(".sub_accordin_presentation")
if sub_sessions:
for sub_session in sub_sessions:
gg.append(
{
"Title": jj,
"Sub": sub_session.h4.get_text(strip=True, separator=" "),
}
)
else:
gg.append(
{
"Title": jj,
"Sub": "None",
}
)
df = pd.DataFrame(gg)
df.to_csv("data.csv", index=False)
print(df)
Prints:
Title Sub
0 IS05 - Industry Symposium Sponsored by Amgen: Advancing Lung Cancer Treatment with Novel Therapeutic Targets None
1 IS06 - Industry Symposium Sponsored by Jazz Pharmaceuticals: Exploring a Treatment Option for Patients with Previously Treated Metastatic Small Cell Lung Cancer (SCLC) None
2 IS07 - Satellite CME Symposium by Sanofi Genzyme: On the Frontline: Immunotherapeutic Approaches in Advanced NSCLC None
3 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.01 - Durvalumab ± Tremelimumab + Chemotherapy as First-line Treatment for mNSCLC: Results from the Phase 3 POSEIDON Study
4 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.02 - Discussant
5 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.03 - Lurbinectedin/doxorubicin versus CAV or Topotecan in Relapsed SCLC Patients: Phase III Randomized ATLANTIS Trial
...
and creates data.csv
(screenshot from LibreOffice):