I want to get data within div loop, so that the values can be ordered in correct rows. Also, I want data from entire page, not only from the visible part of the page.
ruby = driver.find_elements(By.XPATH, "//div[@class='sc-kbGplQ bCRLdc']")
for i in ruby:
# actions.move_to_element(i).perform()
driver.execute_script("arguments[0].scrollIntoView();", i)
time.sleep(INTERVAL)
try:
Firm_name = [Firm_name.text for Firm_name in i.find_elements(By.XPATH, "//div[1]/h2[@class='sc-idjmjb jDJltL']")]
Remediation_status = [Remediation_status.text for Remediation_status in i.find_elements(By.XPATH, "//div[1]/span[2][@class='sc-iKpIOp iKvkEG']")]
Safety_training = [Safety_training.text for Safety_training in i.find_elements(By.XPATH, "//div[2]/span[2][@class = 'sc-iKpIOp iKvkEG']" )]
Worker_number = [Worker_number.text for Worker_number in i.find_elements(By.XPATH, "//div[1]/h2[@class='sc-bsVVwV gnfeLF']")]
Progress_rate = [Progress_rate.text for Progress_rate in i.find_elements(By.XPATH, "//div[2]/h2[@class= 'sc-bsVVwV gnfeLF']")]
except:
print("na")
#driver.execute_script("window.scrollBy(0,500)","")
time.sleep(INTERVAL)
df1 = pd.DataFrame(data=list(zip(Firm_name, Remediation_status, Safety_training, Progress_rate, Worker_number)), columns=['Firm_name', 'Remediation_status', 'Safety_training', 'Progress_rate', 'Worker_number'])
df1.to_csv('namefirm.csv')
When elements might not be in the page, as for example Worker_number
, then it is better to use execute_script
(i.e. javascript) instead of find_element
or find_elements
, because it returns None
if the element is not in the pgae, while .find_elements
returns an empty list and find_element
raises an error, hence if you use find_elements
then you have to add code to check if the list is empty, if you use find_element
you have to add a try-except block.
The problem with scrolling to load new elements, is that if there are hundreds of elements then the it takes a lot of RAM memory and the browser might freeze. A workaround is to remove the elements from the HTML instead of scrolling. We can remove an element by using driver.execute_script('var element = arguments[0]; element.remove();', element)
.
As a final suggestion, use a dictionary instead of a list to store scraped data.
data = {key:[] for key in ['Firm_name', 'Remediation_status', 'Safety_training', 'Progress_rate', 'Worker_number']}
js = f'return document.evaluate(arguments[0], arguments[1], null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue?.innerText;'
max_wait = 9 # seconds
while 1:
factories = []
start = time.time()
while len(factories) < 2:
factories = driver.find_elements(By.CSS_SELECTOR, "#factories>div+div+div>div>div>div+div>div+div+div>div")
if time.time() - start > max_wait:
print('no new factories')
start = -1
break
if start < 0:
break
else:
for factory in factories:
data['Firm_name'] += [driver.execute_script(js, ".//h2", factory)]
data['Remediation_status'] += [driver.execute_script(js, ".//p[contains(.,'Remediation Status')]/span[2]", factory)]
data['Safety_training'] += [driver.execute_script(js, ".//p[contains(.,'Safety Training Program')]/span[2]", factory)]
data['Worker_number'] += [driver.execute_script(js, ".//h2[contains(.,'Workers')]/following-sibling::h2", factory)]
data['Progress_rate'] += [driver.execute_script(js, ".//h2[contains(.,'Progress Rate')]/following-sibling::h2", factory)]
driver.execute_script('var element = arguments[0]; element.remove();', factory)
print(f"{len(data['Firm_name'])} factories scraped", end='\r')
Execution
Then by running pd.DataFrame(data)
you get