Search code examples
google-sheetsgoogle-sheets-querygoogle-sheets-formula

IMPORTHTML error pulling data outside of table format


I'm attempting to learn how to web scrape websites through the use of Google Sheets' ImportXML function. I'm having difficulty pulling all the customer's (188) name, industry and their associated links.

Ex: Banco Santander Spain, Financial Services,https://www.cloudera.com/about/customers/banco-santander-spain.html

The webpage I'm trying this out on is this website!

I can only pull a few customers and I'm not sure why and what I'm doing wrong.

Inspect Source (on the tile of Banco Santander Spain):

<div class="col-md-3 col-sm-6">
<div class="damSearchGrid-item damSearchGrid-slide">
<a href="#" class="dam-dropdown">
<span class="damSearchGrid-img">
<imgsrc="/content/dam/www/dynamic/images/logos/customers/santander-spain-dynamic.png/_jcr_content/renditions/xcq5dam.thumbnail.140.100.png.pagespeed.ic.y7RN8qmYUY.webp" alt="Banco Santander Spain logo" data-pagespeed-url-hash="3896535651" onload="pagespeed.CriticalImages.checkImageForCriticality(this);"></span>
<span class="damSearchGrid-cat"> Financial Services</span>
<strong class="damSearchGrid-name"> Banco Santander Spain</strong>
</a>
</div>
<div class="damSearchGrid-itemDetail clearfix">
<a href="#" class="close">x</a>
<div class="col-sm-9">
<h3>Banco Santander Spain</h3>
<p>Banco Santander is using a big data architecture for a consistent view of data, on-demand data clusters, and faster time to market.</p>
</div>
<div class="col-sm-3">
<a href="https://www.cloudera.com/about/customers/banco-santander-spain.html"> SUCCESS STORY </a><br>
</div>
</div>
</div>
</b>

So I've tried pulling a few things:

First I tried this formula:

=IMPORTXML("https://www.cloudera.com/about/customers.html","//li")

I was only able to pull about 120 names of the customers along with other irrelevant data.

Then I tried:

=IMPORTXML("https://www.cloudera.com/about/customers.html","//h3")

which only returned a short list of about 15 names.

I've tried to do other formulas like the one below but they have ended up in #N/A errors.

Ex:

=IMPORTXML("https://www.cloudera.com/about/customers.html","//span[@class='col-md col-sm-6']")

I'm a beginner and I don't know what I should be looking for in terms of getting the query correct and how to identify the root of the problem. I've searched for IMPORTXML tutorials and examples but the majority of those run through examples where the source is in a table format. My scenario appears to be different, at least to me.

Any guidance, tips, source material will be greatly appreciated.


Solution

  • unfortunately, that won't be possible because the site is controlled by JavaScript and Google Sheets can't understand/import JS. you can test this simply by disabling JS for a given link and you will see a blank page:

    0

    this is all the stuff you can scrape:

    =IMPORTXML("https://www.cloudera.com/about/customers.html", "//*")