I am trying to scrape a table from a website with Python but for some reason all of my known methods have failed. There's a table at https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/ with 45 pages. I have tried to scrape it with using: requests, requests-html (rendered it), BeautifulSoup and selenium as well. This is one of my codes, I won't copy here all of those I tried, methods are similar just with different Python libraries:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get('https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/')
page.html.render(timeout=120)
soup = BeautifulSoup(page.content, 'lxml') #also tried with page.text and 'html.parser' and all permutations
table = soup.find_all(id='table')
My table variable is an empty list here and it shouldn't be. I've tried to find any other web elements within the table with selenium, I tried to find by class, xpath as well, but all of these have failed to find the table or any part of it. I scraped quite few similar websites with these methods and I have never had a problem before this one. Any ideas, please?
You'd see that the result table is in an iframe. You can extract the information directly from the source of the iframe:
Here the code that should save the result onto a .csv file:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
def get_rows(driver):
"""
returns rows from a page
Returns:
Dict
"""
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@class='tr body-row']")))
rows = driver.find_elements(By.XPATH, "//div[@class='tr body-row']")
table_info= {
'Rank': [],
'County':[],
'School/District':[],
'Type':[],
'Total cases':[],
'Student cases':[],
'Staff cases':[]
}
for row in rows:
cols = row.find_elements(By.CLASS_NAME, 'td')
for col, index in enumerate(table_info):
table_info[index].append(cols[col].text)
return table_info
# path to chrome driver
driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe")
driver.get("https://flo.uri.sh/visualisation/3894531/embed?auto=1")
df = pd.DataFrame.from_dict(get_rows(driver))
for _ in range(44):
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//button[@class="pagination-btn next"]'))).click()
df = pd.concat([df, pd.DataFrame.from_dict(get_rows(driver))])
print(df)
df.to_csv('COVID-19_cases_reported_in_Ohio_schools.csv', index=False)