image google-sheets web-scraping xpath google-sheets-formula

Image scraping in Google Sheets using ImportXml

Using Google Sheets, I am trying to scrape images from a real estate website to display in a cell along side the property details. I have been able demonstrate that this is possible using a simple example, but when I try to formulate the xpath query to scrape the specific images I need I continually get an error.

As a working example I will use this webpage.

The cell references in the ImportXML command in my examples always points to this URL.

The images I want are part of a slider and easily identifiable by their class: 'rsImg rsMainSlideImage'

I attempted to scrape the first image in the slider using the following command:

=IMPORTXML(A2, "(//img[@class='rsImg rsMainSlideImage'])[1]/@src")

I continually get the error:

"Imported content is empty"

To diagnose the problem I did a simpler example:

I am able to get the first image from the page (not my target image) using the following xPath:

=IMPORTXML(A2, "(//img)[1]/@src")

This successfully displays the URL of the image.

I am able to display this image in a cell by wrapping it in the image command:

=image(IMPORTXML(A2, "(//img)[1]/@src"))

This shows that in principle I should be able to scrape an image and display it in a cell.

But I am not able to select the images I am targeting using the class without an error. This is the command that I used:

=IMPORTXML(A2, "(//img[@class='rsImg rsMainSlideImage'])[1]/@src")

Other than selecting the image based on the class attribute, I am not sure what the difference is between my example that works and the example that doesn't.

I would be very grateful for any support to get this working.

Solution

The problem lies with the website

Your xpaths seem fine, but the website HTML does not!

If you use the HTML validator with this web site:

https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.jelliscraig.com.au%2Fproperty-details-228A-Victoria-Street-Ballarat-East%2F1042039

You'll see that it has a bunch of errors, and most critically for XML, it has stray tags. So for the most part, the XML parser either gets confused or rejects it as invalid.

I experimented with a bunch of different xpaths but could not find any img tag anywhere, even when making an x-path for everything / or /* - which tells me that this HTML is likely poorly formed and the XML parser can't read it.

A Workaround

=REGEXEXTRACT(
    IMPORTXML(
        "https://www.jelliscraig.com.au/property-details-228A-Victoria-Street-Ballarat-East/1042039",
        "/"
    ),
    "https:\/\/images\.listonce.+\.jpg"
)

What I did find looking through the result from the / x-path was that the link seemed to be referred to in a couple of other places. Maybe this is consistent for most of the site and most likely, all the images are served by the same URL format:

https://images.listonce.com.au ... jpg

So using this information you can wrap the IMPORTXML in a REGEXEXTRACT and with a loose regex like:

https:\/\/images\.listonce.+\.jpg

Will return, for the example URL you gave:

https://images.listonce.com.au/custom/m/listings/228a-victoria-street-ballarat-east-vic-3350/039/01042039_img_01.jpg

Which seems like it the Image you are looking for.

Apps Script

Maybe look into Apps Script and specifically UrlFetchApp. With these tools you will have a lot more control over the HTML that you fetch and will give you far more options to scrape data.

This is an example of the same process but with Apps Script

function getImageUrl() {
  // Fetch the website
  let response = UrlFetchApp.fetch("https://www.jelliscraig.com.au/property-details-228A-Victoria-Street-Ballarat-East/1042039")
  // Get the text from the response
  let html = response.getContentText()
  // Use Regex to Match the Tag
  let result = html.match(/(?<=img src=.+)https:\/\/images.listonce.com.au\/.+\.jpg/)
  // Log the first result
  Logger.log(result[0])
}

Which will log https://images.listonce.com.au/custom/l/listings/228a-victoria-street-ballarat-east-vic-3350/039/01042039_img_01.jpg

Image scraping in Google Sheets using ImportXml

The problem lies with the website

A Workaround

Apps Script

References