Search code examples
pythonseleniumweb-scrapinglazy-loading

Getting lazy loaded images while scraping


I am trying to scrape the images of this website, but I am unable to get the images src and rather getting the lazy loading src attribute of the images.

import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time

url = "https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/squads"
s = Service("M:\WebScraping\chromedriver.exe")

driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get(url)
time.sleep(5)
driver.execute_script("window.scrollTo(0, 500);")

page = urllib.request.urlopen(url)
doc = BeautifulSoup(page, "html.parser")

teams = doc.find(class_="ds-p-0").find(class_="ds-mb-4")

for team in teams:
    print(team.img["src"])
    file_name = team.img["alt"]
    img_file = open(file_name + ".png", "wb")
    img_file.write(urllib.request.urlopen(team.img["src"]).read())
    img_file.close()

This is the output I am receiving. (Which are just lazy loaded images)

https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg

But I rather want to get the actual src of the image as in these -

https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png

Solution

  • BeautifulSoup is not able to load javascript and other stuff, that's why when you run

    page = urllib.request.urlopen(url)
    doc = BeautifulSoup(page, "html.parser")
    

    you get the lazy image links. On the other side, Selenium is able to load almost everything, so you can load the page with Selenium and then pass its page source to BeautifulSoup as parameter instead of the url:

    doc = BeautifulSoup(driver.page_source, "html.parser")
    

    In this way BeautifulSoup will use the full HTML of the page. The following code prints the urls both with Selenium and BeautifulSoup, so that you can see both techniques.

    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
        
    chromedriver_path = '...'
    driver = webdriver.Chrome(service=Service(chromedriver_path), options=options)
    
    url = "https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/squads"
    driver.get(url)
    
    # wait (up to 20 seconds) until the images are visible on page
    images = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".ds-p-0 .ds-mb-4 img")))
    # scroll to the last image, so that all images get rendered correctly
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
    time.sleep(2)
    
    # PRINT URLS USING SELENIUM
    
    print('Selenium')
    for img in images:
        print(img.get_attribute('src'))
    
    # PRINT URLS USING BEAUTIFULSOUP
    
    doc = BeautifulSoup(driver.page_source, "html.parser")
    teams = doc.find(class_="ds-p-0").find(class_="ds-mb-4")
    
    print('BeautifulSoup')
    for team in teams:
        print(team.img["src"])
    

    Output

    Selenium 
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313421.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313422.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/334700/334707.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313419.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/344000/344062.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/317000/317003.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313423.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313418.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313480.logo.png
    
    BeautifulSoup
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313421.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313422.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/334700/334707.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313419.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/344000/344062.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/317000/317003.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313423.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313418.logo.png
    https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313480.logo.png