I need to loop through and scrape one element (identical in class name for all the pages) from one million webpages. I have set up the code in the following (simplified) way:
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 10)
detail_dict = {}
for i in range(1000000):
url = f'http://www.cnappc.it/risultato.aspx?IDAssociato={i}&tipo=1#edit'
driver.get(url)
elem_detail = wait.until(expected_conditions
.presence_of_element_located((By.CLASS_NAME, 'content')))
detail_dict[i] = elem_detail.text
The code runs rather smoothly and, when I interrupt the kernel to check, I notice the i
and url
increasing for each iteration. However, the driver webpage gets 'stuck' on the very first URL, i.e. http://www.cnappc.it/risultato.aspx?IDAssociato=0&tipo=1#edit, thus elem_detail.text
returns the same string over and over. It seems as if though the driver webpage cannot keep up with the driver.get(url)
method, despite the fact that .get()
waits for the page to load fully.
From Selenium-Python/Getting Started:
The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script.
I inserted an expected condition for elem_detail
, to no avail. Setting a time.sleep(2)
after driver.get(url)
allows for the driver webpage to change and display different content, but then I would face a major slowdown. Even then, the page would get stuck from time to time, and dictionary value entries end up repeating unsystematically.
Would you be able to reccommend a robust approach which does not involve time.sleep()
?
FYI: I am using selenium with geckodriver.
I managed to solve my issue switching to webdriver.Chrome()
. The webdriver actually waits for each page to load, searches the class element and moves on to the next page, without specifying any time.sleep()
.