Search code examples
pythonseleniumweb-scrapingiframe

Cannot extract/load all hrefs from iframe (inside html page) while parsing Webpage


l am really struggling with this case and have been trying all day. Please l need your help.I am trying to scrape this webpage: https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or= l want to get all 137 href-s (137 documents). The code l used:

   def test(self):
        final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
        self.driver.get(final_url)
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')
        iframes = soup.find('iframe')
        src = iframes['src']
        base = 'https://decisions.scc-csc.ca/'
        main_url = urljoin(base, src)
        self.driver.get((main_url))
        browser = self.driver
        elem = browser.find_element_by_tag_name("body")
        no_of_pagedowns = 20
        while no_of_pagedowns:
            elem.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.2)
            no_of_pagedowns -= 1

The problem is that it loads only 25 first documents (href) and don't know how to do that.


Solution

  • This code scrolls down until all elements are visible, then save the urls of the pdfs in the list pdfs. Notice that all the work is done with selenium, without using BeautifulSoup

    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
    driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')
    
    # wait for the iframe to be loaded and then switch to it
    WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))
    
    # in this case number_of_results = 137
    number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
    pdfs = []
    
    while len(pdfs) < number_of_results:
        pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
        # scroll down to the last visible row
        driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
        time.sleep(1)
    
    pdfs = [pdf.get_attribute('href') for pdf in pdfs]