Search code examples
htmlcssselenium-webdriverbeautifulsouptesseract

How can I read and save an image from page with selenium, beautifulsoup and python 3?


My agenda here is that I want to save a single image from a website after a login procedure. The image after inspecting returns to have a full xpath of /html/body/form/main/div/section/div[1]/div/div[2]/div/img. I wish to use beautiful soup or image crawler and save the image to a variable and then extract the text from the image with tesseract. Lately I have been struggling with urllib, urllib.requests, selinium's read images by x.path. My idea was to use selenium to save the image but didn't find any results. Now I need help with the coding part where I want to know if I can save the image to a variable and if tesseract can access this image from that variable. Both image sample and its inspect images are given below. (the inspected text image is highlighted). The form is just a sample, and doesn't exists in real life (at least I haven't known to know one). Any help would be appreciated. Thanks a lot.

image1:

enter image description here

Image2:

enter image description here


Solution

  • You can use urllib to save the image

    import urllib
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    driver.get(WEBSITE_URL)
    
    # get the image  
    img = driver.find_element_by_xpath('/html/body/form/main/div/section/div[1]/div/div[2]/div/img')
    src = img.get_attribute('src')
    
    # download the image
    urllib.request.urlretrieve(src, "img.png")
    

    this will save the image to img.png file in your working directory, you can then use image processing and tesseract to extract the text from it. I don't recommend using static XPATH to find the image, because it might change if the website owner changes anything on the site, instead, you should use this:

    img = driver.find_element_by_id("ContentPlaceHolder1_Imgquestions"),

    so that even if the website layout changes you will still be able to find the image by its id.