Search code examples
pythonloopsselenium-webdriverscroll

How to loop within div to maintain correct order of values, and scroll page to get entire data


I want to get data within div loop, so that the values can be ordered in correct rows. Also, I want data from entire page, not only from the visible part of the page.

  1. How can I get Firm_name, Remediation_status, ... from div[@class='sc-kbGplQ bCRLdc']?
  2. The code below gives less than 20 rows, while total firms are 1800+. How can I scroll page and get data from entire page? Thanks in advance.
ruby = driver.find_elements(By.XPATH, "//div[@class='sc-kbGplQ bCRLdc']")
for i in ruby:    
#    actions.move_to_element(i).perform()
    driver.execute_script("arguments[0].scrollIntoView();", i)
    time.sleep(INTERVAL)    
    

    try:
        Firm_name = [Firm_name.text for Firm_name in i.find_elements(By.XPATH, "//div[1]/h2[@class='sc-idjmjb jDJltL']")]        
        Remediation_status = [Remediation_status.text for Remediation_status in i.find_elements(By.XPATH, "//div[1]/span[2][@class='sc-iKpIOp iKvkEG']")]
        Safety_training = [Safety_training.text for Safety_training in i.find_elements(By.XPATH, "//div[2]/span[2][@class = 'sc-iKpIOp iKvkEG']" )]
        Worker_number = [Worker_number.text for Worker_number in i.find_elements(By.XPATH, "//div[1]/h2[@class='sc-bsVVwV gnfeLF']")]
        Progress_rate = [Progress_rate.text for Progress_rate in i.find_elements(By.XPATH, "//div[2]/h2[@class= 'sc-bsVVwV gnfeLF']")]        
    except:
        print("na")
#driver.execute_script("window.scrollBy(0,500)","")
time.sleep(INTERVAL)
df1 = pd.DataFrame(data=list(zip(Firm_name, Remediation_status, Safety_training, Progress_rate, Worker_number)), columns=['Firm_name', 'Remediation_status', 'Safety_training', 'Progress_rate', 'Worker_number'])
df1.to_csv('namefirm.csv')


Solution

  • When elements might not be in the page, as for example Worker_number, then it is better to use execute_script (i.e. javascript) instead of find_element or find_elements, because it returns None if the element is not in the pgae, while .find_elements returns an empty list and find_element raises an error, hence if you use find_elements then you have to add code to check if the list is empty, if you use find_element you have to add a try-except block.

    The problem with scrolling to load new elements, is that if there are hundreds of elements then the it takes a lot of RAM memory and the browser might freeze. A workaround is to remove the elements from the HTML instead of scrolling. We can remove an element by using driver.execute_script('var element = arguments[0]; element.remove();', element).

    As a final suggestion, use a dictionary instead of a list to store scraped data.

    data = {key:[] for key in ['Firm_name', 'Remediation_status', 'Safety_training', 'Progress_rate', 'Worker_number']}
    js = f'return document.evaluate(arguments[0], arguments[1], null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue?.innerText;'
    max_wait = 9 # seconds
    
    while 1:
    
        factories = []
        start = time.time()
        while len(factories) < 2:
            factories = driver.find_elements(By.CSS_SELECTOR, "#factories>div+div+div>div>div>div+div>div+div+div>div")
            if time.time() - start > max_wait:
                print('no new factories')
                start = -1
                break
        
        if start < 0:
            break
        else:
            for factory in factories:
    
                data['Firm_name']          += [driver.execute_script(js, ".//h2", factory)]
                data['Remediation_status'] += [driver.execute_script(js, ".//p[contains(.,'Remediation Status')]/span[2]", factory)]
                data['Safety_training']    += [driver.execute_script(js, ".//p[contains(.,'Safety Training Program')]/span[2]", factory)]
                data['Worker_number']      += [driver.execute_script(js, ".//h2[contains(.,'Workers')]/following-sibling::h2", factory)]
                data['Progress_rate']      += [driver.execute_script(js, ".//h2[contains(.,'Progress Rate')]/following-sibling::h2", factory)]
    
                driver.execute_script('var element = arguments[0]; element.remove();', factory)
                print(f"{len(data['Firm_name'])} factories scraped", end='\r')
    

    Execution

    enter image description here

    Then by running pd.DataFrame(data) you get

    enter image description here