Search code examples
pythonseleniumselenium-webdriverselenium-chromedriverwebdriverwait

Why is selenium webdriver in python not returning all image links?


I am using selenium WebDriver to collect the URL's to images from a website that is loaded with JavaScript. It appears as though my following code returns only 160 out of the about 240 links. Why might this be - because of the JavaScript rendering?

Is there a way to adjust my code to get around this?

driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
driver.get('https://www.politicsanddesign.com/')
img_url = driver.find_elements_by_xpath("//div[@class='responsive-image-wrapper']/img")

img_url2 = []
for element in img_url:
    new_srcset = 'https:' + element.get_attribute("srcset").split(' 400w', 1)[0]
    img_url2.append(new_srcset)

Solution

  • You need to wait for all those elements to be loaded.
    The recommended approach is to use WebDriverWait expected_conditions explicit waits.
    This code is giving me 760-880 elements in the img_url2 list:

    import time
    
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = Options()
    options.add_argument("start-maximized")
    
    webdriver_service = Service('C:\webdrivers\chromedriver.exe')
    driver = webdriver.Chrome(options=options, service=webdriver_service)
    wait = WebDriverWait(driver, 10)
    
    url = "https://www.politicsanddesign.com/"
    
    driver.get(url) # once the browser opens, turn off the year filter and scroll all the way to the bottom as the page does not load all elements on rendering
    wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='responsive-image-wrapper']/img")))
    # time.sleep(2)
    img_url = driver.find_elements(By.XPATH, "//div[@class='responsive-image-wrapper']/img")
    
    img_url2 = []
    for element in img_url:
        new_srcset = 'https:' + element.get_attribute("srcset").split(' 400w', 1)[0]
        img_url2.append(new_srcset)
    

    I'm not sure if this code is stable enough, so if needed you can activate the delay between the wait line and the next line grabbing all those img_url.

    EDIT:

    Once the browser opens, you'll need to turn of the page's filter and then scroll all the way to the bottom of the page as it does not automatically load all of the elements when it renders; only once you've worked with the page a little bit.