Search code examples
pythonselenium-webdriverflaskweb-scrapinglazy-loading

Can't get image URLs from lazy-loaded page source


I am trying to scrape and repurpose the news images and titles from a newsfeed page so I can reuse them in a signage display (Xibo). Basically I just want the first three rows of the main content of this URL without any header/footer info, and no extra code/scripting, etc. Just the medium-sized picture and title under it. Would like to scrape the images/titles, then render a simple html page with Flask once a day for the CMS to read. https://news.clemson.edu/tag/extension/

I gathered that I need selenium to obtain the rendered page in this case? In the code below, I am having difficulty finding the image URLs properly. This will read in the page and scroll, but finds no images. I tried some of the nested divs, but no luck either. Can someone point me in the right direction to obtain the image URLs (and ultimately the titles)?

#News feed test for Xibo Signage
#from flask import Flask, render_template
from markupsafe import Markup
#app=Flask(__name__) 
from urllib.request import Request, urlopen

from bs4 import BeautifulSoup
import requests

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

#installed chrome driver in scripts so don't need next lines?    
#chromedriver_path = '...'
driver = webdriver.Chrome()

url = "https://news.clemson.edu/tag/extension/"

driver.get(url)

# wait (up to 20 seconds) until the images are visible on page
images = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "site-main")))
# scroll to the last image, so that all images get rendered correctly
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
time.sleep(2)

# PRINT URLS USING SELENIUM -for test (will pass to Flask)

print('Selenium')
for img in images:
    print(img.get_attribute('src'))



#@app.route('/') 
#def home():
#   return render_template('home.html',thumbnailmk=thumbnailmk)

#if __name__ == '__main__':
#   app.run(host='0.0.0.0')
#   app.run(debug=True)

Solution

  • Issue here is that you do not select any image, try to change your strategy and focus on what you really want to locate:

    for e in driver.find_elements(By.CSS_SELECTOR,'article img'):
        print(e.get_attribute('data-srcset').split()[0])
    
    Example

    This example points to the data-srcset attribute and picks the first image url:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    
    url = "https://news.clemson.edu/tag/extension/"
    
    driver.get(url)
    
    for e in driver.find_elements(By.CSS_SELECTOR,'article img'):
        print(e.get_attribute('data-srcset').split()[0])
    

    But there is no need to go with selenium you could also use requests:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://news.clemson.edu/tag/extension/"
    
    soup = BeautifulSoup(requests.get(url, headers={'user-agent':'some-agent'}).text)
    
    for e in soup.select('article img.lazyload'):
        print(e.get('data-src'))
    

    https://news.clemson.edu/wp-content/uploads/2023/04/ag-and-art-scaled.jpg
    https://news.clemson.edu/wp-content/uploads/2024/03/AgTech_Forum_FeatureImage.jpg
    ...
    https://news.clemson.edu/wp-content/uploads/2023/09/Cooperative-Extension-RGB-color_featured.jpg
    https://news.clemson.edu/wp-content/uploads/2023/09/20141107-simpson-5911-X5.jpg
    https://news.clemson.edu/wp-content/uploads/2023/09/TailgateFoodSafety.jpg