I am trying to scrape a table from a website with Python but for some reason all of my known methods have failed. There's a table at https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/ with 45 pages. I have tried to scrape it with using: requests, requests-html (rendered it), BeautifulSoup and selenium as well. This is one of my codes, I won't copy here all of those I tried, methods are similar just with different Python libraries:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get('https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/')
soup = BeautifulSoup(page.content, 'lxml') #also tried with page.text and 'html.parser' and all permutations
table = soup.find_all(id='table')
My table variable is an empty list here and it shouldn't be. I've tried to find any other web elements within the table with selenium, I tried to find by class, xpath as well, but all of these have failed to find the table or any part of it. I scraped quite few similar websites with these methods and I have never had a problem before this one. Any ideas, please?
You'd see that the result table is in an iframe. You can extract the information directly from the source of the iframe:
Here the code that should save the result onto a .csv file:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
def get_rows(driver):
returns rows from a page
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@class='tr body-row']")))
rows = driver.find_elements(By.XPATH, "//div[@class='tr body-row']")
table_info= {
'Rank': [],
'Total cases':[],
'Student cases':[],
'Staff cases':[]
for row in rows:
cols = row.find_elements(By.CLASS_NAME, 'td')
for col, index in enumerate(table_info):
return table_info
# path to chrome driver
driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe")
df = pd.DataFrame.from_dict(get_rows(driver))
for _ in range(44):
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//button[@class="pagination-btn next"]'))).click()
df = pd.concat([df, pd.DataFrame.from_dict(get_rows(driver))])
df.to_csv('COVID-19_cases_reported_in_Ohio_schools.csv', index=False)