Search code examples
imagegoogle-sheetsweb-scrapingxpathgoogle-sheets-formula

Image scraping in Google Sheets using ImportXml


Using Google Sheets, I am trying to scrape images from a real estate website to display in a cell along side the property details. I have been able demonstrate that this is possible using a simple example, but when I try to formulate the xpath query to scrape the specific images I need I continually get an error.

As a working example I will use this webpage.

The cell references in the ImportXML command in my examples always points to this URL.

The images I want are part of a slider and easily identifiable by their class: 'rsImg rsMainSlideImage'

I attempted to scrape the first image in the slider using the following command:

=IMPORTXML(A2, "(//img[@class='rsImg rsMainSlideImage'])[1]/@src")

I continually get the error:

"Imported content is empty"

To diagnose the problem I did a simpler example:

I am able to get the first image from the page (not my target image) using the following xPath:

=IMPORTXML(A2, "(//img)[1]/@src")

This successfully displays the URL of the image.

I am able to display this image in a cell by wrapping it in the image command:

=image(IMPORTXML(A2, "(//img)[1]/@src"))

This shows that in principle I should be able to scrape an image and display it in a cell.

But I am not able to select the images I am targeting using the class without an error. This is the command that I used:

=IMPORTXML(A2, "(//img[@class='rsImg rsMainSlideImage'])[1]/@src")

Other than selecting the image based on the class attribute, I am not sure what the difference is between my example that works and the example that doesn't.

I would be very grateful for any support to get this working.


Solution

  • The problem lies with the website

    Your xpaths seem fine, but the website HTML does not!

    If you use the HTML validator with this web site:

    https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.jelliscraig.com.au%2Fproperty-details-228A-Victoria-Street-Ballarat-East%2F1042039

    You'll see that it has a bunch of errors, and most critically for XML, it has stray tags. So for the most part, the XML parser either gets confused or rejects it as invalid.

    I experimented with a bunch of different xpaths but could not find any img tag anywhere, even when making an x-path for everything / or /* - which tells me that this HTML is likely poorly formed and the XML parser can't read it.

    A Workaround

    =REGEXEXTRACT(
        IMPORTXML(
            "https://www.jelliscraig.com.au/property-details-228A-Victoria-Street-Ballarat-East/1042039",
            "/"
        ),
        "https:\/\/images\.listonce.+\.jpg"
    )
    

    What I did find looking through the result from the / x-path was that the link seemed to be referred to in a couple of other places. Maybe this is consistent for most of the site and most likely, all the images are served by the same URL format:

    https://images.listonce.com.au ... jpg
    

    So using this information you can wrap the IMPORTXML in a REGEXEXTRACT and with a loose regex like:

    https:\/\/images\.listonce.+\.jpg
    

    Will return, for the example URL you gave:

    https://images.listonce.com.au/custom/m/listings/228a-victoria-street-ballarat-east-vic-3350/039/01042039_img_01.jpg

    Which seems like it the Image you are looking for.

    Apps Script

    Maybe look into Apps Script and specifically UrlFetchApp. With these tools you will have a lot more control over the HTML that you fetch and will give you far more options to scrape data.

    This is an example of the same process but with Apps Script

    function getImageUrl() {
      // Fetch the website
      let response = UrlFetchApp.fetch("https://www.jelliscraig.com.au/property-details-228A-Victoria-Street-Ballarat-East/1042039")
      // Get the text from the response
      let html = response.getContentText()
      // Use Regex to Match the Tag
      let result = html.match(/(?<=img src=.+)https:\/\/images.listonce.com.au\/.+\.jpg/)
      // Log the first result
      Logger.log(result[0])
    }
    

    Which will log https://images.listonce.com.au/custom/l/listings/228a-victoria-street-ballarat-east-vic-3350/039/01042039_img_01.jpg

    References