Search code examples
pythonselenium-webdriverurllib

get_attribute('src') is not getting the url anymore


I wrote a script for image scraping on google images using selenium webdriver. The webdriver navigates through images and fetches the url. But, today when I ran the script, it didn't get the url for any of the images.

from selenium import webdriver
import urllib.request
from PIL import Image
import os


keyword=input('keyword : ')
n=150
url=input('url : ')

# provide path to dircetory before running the code
path='E://Old//cust_data'

if keyword not in os.listdir(path):
    os.mkdir(path+'//'+keyword)

img_dir=path+'//'+keyword

driver=webdriver.Chrome('E://Old//card//chromedriver.exe')
driver.get(url)



i=1
j=1
while j<=n:

    try:
        driver.find_element_by_xpath('//*[@id="islrg"]/div[1]/div[{}]/a[1]/div[1]/img'.format(i)).click()
        img=driver.find_element_by_xpath('//*[@id="Sva75c"]/div/div/div[3]/div[2]/c-wiz/div/div[1]/div[1]/div/div[2]/a/img')
        link=img.get_attribute('src')
        print(link)
        urllib.request.urlretrieve(link,img_dir+'//'+keyword+' '+str(j)+'.jpg')

        size=os.stat(img_dir+'//'+keyword+' '+str(j)+'.jpg').st_size

        if size<15000:
            os.remove(img_dir+'//'+keyword+' '+str(j)+'.jpg')
        else:
            im=Image.open(img_dir+'//'+keyword+' '+str(j)+'.jpg')
            print(keyword+' '+str(j)+'.jpg',(im.size[0],im.size[1]))
            j+=1
        i+=1


    except:
        i+=1
        print('error')
        pass

driver.close()

Its returns 'error' for every image. It was working fine, I don't know what's causing this. Also, while navigation, the webdriver sometimes just stops. There's no error or anything, it just stops.


Solution

  • The first thing I think would check is what exception has been thrown, because maybe the exception is appearing somewhere else, for example, during file saving.

    Try adding:

    try:
        ...
    except Exception as e:
        print("Error with exception: ", e)
    

    This will give you information on what is going wrong.

    The second issue is your searching method, try avoiding as much similar searches ("//*[@id="islrg"]/div[1]/div[{}]/a[1]/div[1]/img") and search for more specific patterns.

    And final thing to point is tool you are using. As you described all you need is to get images from page. This task can be achieved much after and efficient with web scraping libraries such as BeatifulSoup not with automotive as Selenium is.

    As example to your problem I wrote small script that would scrape all images from page in just a second using bs4:

    import requests # to download HTML file
    from bs4 import BeautifulSoup as bs # to parse data
    
    a = requests.get("some URL") # download html
    
    soup = bs(a.text) # feed it to BeautifulSoup
    all_imgs = soup.find_all("img") # extract all images
    
    img_urls = []
    
    for img in all_imgs: # iterate over all images
        img_urls.append(img.get("src")) # append to list img "scr" attribute value
    

    Caution:

    When using such approach downloading plain HTML, sometimes it may differ from what you see when you open same URL in your browser, so when creating such scraper try:

    with open("test.html", "w") as f:
        f.write(page.text)
    

    And then inspect this file in your browser to find a way of getting needed information.