Search code examples
pythonselenium-webdriverweb-scrapingwebdriverwaitexpected-condition

How to find web element with Selenium Python while iterating through URLs


I need to loop through and scrape one element (identical in class name for all the pages) from one million webpages. I have set up the code in the following (simplified) way:

driver = webdriver.Firefox()
wait = WebDriverWait(driver, 10)
detail_dict = {}
for i in range(1000000):
    url = f'http://www.cnappc.it/risultato.aspx?IDAssociato={i}&tipo=1#edit'
    driver.get(url)
    elem_detail = wait.until(expected_conditions
                             .presence_of_element_located((By.CLASS_NAME, 'content')))
    detail_dict[i] = elem_detail.text

The code runs rather smoothly and, when I interrupt the kernel to check, I notice the i and url increasing for each iteration. However, the driver webpage gets 'stuck' on the very first URL, i.e. http://www.cnappc.it/risultato.aspx?IDAssociato=0&tipo=1#edit, thus elem_detail.text returns the same string over and over. It seems as if though the driver webpage cannot keep up with the driver.get(url) method, despite the fact that .get() waits for the page to load fully.

From Selenium-Python/Getting Started:

The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script.

I inserted an expected condition for elem_detail, to no avail. Setting a time.sleep(2) after driver.get(url) allows for the driver webpage to change and display different content, but then I would face a major slowdown. Even then, the page would get stuck from time to time, and dictionary value entries end up repeating unsystematically.

Would you be able to reccommend a robust approach which does not involve time.sleep()?


FYI: I am using selenium with geckodriver.


Solution

  • I managed to solve my issue switching to webdriver.Chrome(). The webdriver actually waits for each page to load, searches the class element and moves on to the next page, without specifying any time.sleep().