Search code examples
pythonseleniumweb-scrapingbeautifulsouppython-requests-html

I cannot scrape a table from a website with usual web scraping tools


I am trying to scrape a table from a website with Python but for some reason all of my known methods have failed. There's a table at https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/ with 45 pages. I have tried to scrape it with using: requests, requests-html (rendered it), BeautifulSoup and selenium as well. This is one of my codes, I won't copy here all of those I tried, methods are similar just with different Python libraries:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
page = session.get('https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/')
page.html.render(timeout=120)
soup = BeautifulSoup(page.content, 'lxml') #also tried with page.text and 'html.parser' and all permutations
table = soup.find_all(id='table')

My table variable is an empty list here and it shouldn't be. I've tried to find any other web elements within the table with selenium, I tried to find by class, xpath as well, but all of these have failed to find the table or any part of it. I scraped quite few similar websites with these methods and I have never had a problem before this one. Any ideas, please?


Solution

  • You'd see that the result table is in an iframe. You can extract the information directly from the source of the iframe:

    https://flo.uri.sh/visualisation/3894531/embed?auto=1

    Here the code that should save the result onto a .csv file:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    def get_rows(driver):
        """
        returns rows from a page
        
        Returns:
        Dict
        """
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@class='tr body-row']")))
        rows = driver.find_elements(By.XPATH, "//div[@class='tr body-row']")
        table_info= {
            'Rank': [],
            'County':[],
            'School/District':[],
            'Type':[],
            'Total cases':[],
            'Student cases':[],
            'Staff cases':[]
        }
        
        for row in rows:
            cols = row.find_elements(By.CLASS_NAME, 'td')
            for col, index in enumerate(table_info):
                table_info[index].append(cols[col].text)
    
        return table_info
    
    # path to chrome driver
    driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe")
    
    driver.get("https://flo.uri.sh/visualisation/3894531/embed?auto=1")
    
    
    df = pd.DataFrame.from_dict(get_rows(driver))
    
    for _ in range(44):
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//button[@class="pagination-btn next"]'))).click()
        df = pd.concat([df, pd.DataFrame.from_dict(get_rows(driver))])
    
    print(df)
    df.to_csv('COVID-19_cases_reported_in_Ohio_schools.csv', index=False)