My importhtml
and importxml
functions in Google Sheets suddenly stopped working. I'm guessing it has something to do with the URL as it has been working for about a month (since I set it up) and has only recently stopped (last week). Now half the sheets is displaying #N/A unfortunately. Just wondering if someone more advanced in this area could share their insight into the problem.
IMPORTXML example:
URL: https://au.finance.yahoo.com/quote/AZJ.AX/profile?p=AZJ.AX
xPath: //*[@id='Col1-0-Profile-Proxy']/section/div[1]/div/div/p[2]/span[4]
Function returning "Resource at URL not found".
Here is a link to an example file: https://docs.google.com/spreadsheets/d/1ml3cCHIEk4vTCn3GyymXGxyCpVBgMprMzGP7OYhKPC0/edit?usp=sharing
This looks like it may in fact be a issue on Google's side:
There has not been any resolution, but go and star that issue if it affects you or you want updates.
I have not used the Yahoo site a lot, but maybe it has changed its format in a way that makes it unreadable for IMPORTXML. I also tested it with IMPORTHTML, with the same result. The function itself is still working with other sites though.
Testing the site you gave, disabling JavaScript to ensure the data is not dynamically generated (which makes it unreadable for IMPORTHTML), there is in fact a table:
<table class="W(100%)" data-reactid="36">
<thead data-reactid="37">
<tr class="C($tertiaryColor) Fz(xs) BdB Bdc($seperatorColor)" data-reactid="38">
<th class="Ta(start) Py(6px) Fw(n)" data-reactid="39"><span data-reactid="40">Name</span></th>
<th class="Ta(start) Py(6px) Fw(n)" data-reactid="41"><span data-reactid="42">Title</span></th>
<th class="Ta(end) Py(6px) Fw(n)" data-reactid="43"><span data-reactid="44">Pay</span></th>
<th class="Ta(end) Py(6px) Fw(n)" data-reactid="45"><span data-reactid="46">Exercised</span></th>
<th class="Ta(end) Py(6px) Fw(n)" data-reactid="47"><span data-reactid="48">Year born</span></th>
</tr>
</thead>
<tbody data-reactid="49">
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="50">
<td class="Ta(start)" data-reactid="51">
<span class="" data-reactid="52">
<!-- react-text: 53 -->Mr. Andrew T. Harding<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="54">
<span class="" data-reactid="55">
<!-- react-text: 56 -->MD, CEO & Director<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="57">
<span class="" data-reactid="58">
<!-- react-text: 59 -->2.78M<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="60"><span class="" data-reactid="61"><span data-reactid="62">N/A</span></span></td>
<td class="Ta(end)" data-reactid="63">
<span class="" data-reactid="64">
<!-- react-text: 65 -->1967<!-- /react-text -->
</span>
</td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="66">
<td class="Ta(start)" data-reactid="67">
<span class="" data-reactid="68">
<!-- react-text: 69 -->Mr. George Lippiatt<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="70">
<span class="" data-reactid="71">
<!-- react-text: 72 -->CFO & Group Exec. Strategy<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="73">
<span class="" data-reactid="74">
<!-- react-text: 75 -->481.25k<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="76"><span class="" data-reactid="77"><span data-reactid="78">N/A</span></span></td>
<td class="Ta(end)" data-reactid="79"><span class="" data-reactid="80"><span data-reactid="81">N/A</span></span></td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="82">
<td class="Ta(start)" data-reactid="83">
<span class="" data-reactid="84">
<!-- react-text: 85 -->Mr. Ed McKeiver<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="86">
<span class="" data-reactid="87">
<!-- react-text: 88 -->Group Exec. of Coal<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="89">
<span class="" data-reactid="90">
<!-- react-text: 91 -->861k<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="92"><span class="" data-reactid="93"><span data-reactid="94">N/A</span></span></td>
<td class="Ta(end)" data-reactid="95"><span class="" data-reactid="96"><span data-reactid="97">N/A</span></span></td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="98">
<td class="Ta(start)" data-reactid="99">
<span class="" data-reactid="100">
<!-- react-text: 101 -->Mr. Michael G. Carter BEng, BBus, MAICD, AFAIM<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="102">
<span class="" data-reactid="103">
<!-- react-text: 104 -->Group Exec. of Technical Services & Planning<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="105"><span class="" data-reactid="106"><span data-reactid="107">N/A</span></span></td>
<td class="Ta(end)" data-reactid="108"><span class="" data-reactid="109"><span data-reactid="110">N/A</span></span></td>
<td class="Ta(end)" data-reactid="111"><span class="" data-reactid="112"><span data-reactid="113">N/A</span></span></td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="114">
<td class="Ta(start)" data-reactid="115">
<span class="" data-reactid="116">
<!-- react-text: 117 -->Mr. Christopher L. Vagg<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="118">
<span class="" data-reactid="119">
<!-- react-text: 120 -->Head of Investor Relations & Group Treasurer<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="121"><span class="" data-reactid="122"><span data-reactid="123">N/A</span></span></td>
<td class="Ta(end)" data-reactid="124"><span class="" data-reactid="125"><span data-reactid="126">N/A</span></span></td>
<td class="Ta(end)" data-reactid="127"><span class="" data-reactid="128"><span data-reactid="129">N/A</span></span></td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="130">
<td class="Ta(start)" data-reactid="131">
<span class="" data-reactid="132">
<!-- react-text: 133 -->Mark Hairsine<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="134">
<span class="" data-reactid="135">
<!-- react-text: 136 -->Mang. of Media & Communications<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="137"><span class="" data-reactid="138"><span data-reactid="139">N/A</span></span></td>
<td class="Ta(end)" data-reactid="140"><span class="" data-reactid="141"><span data-reactid="142">N/A</span></span></td>
<td class="Ta(end)" data-reactid="143"><span class="" data-reactid="144"><span data-reactid="145">N/A</span></span></td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="146">
<td class="Ta(start)" data-reactid="147">
<span class="" data-reactid="148">
<!-- react-text: 149 -->Ms. Tina Thomas<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="150">
<span class="" data-reactid="151">
<!-- react-text: 152 -->Group Exec. of Corp.<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="153"><span class="" data-reactid="154"><span data-reactid="155">N/A</span></span></td>
<td class="Ta(end)" data-reactid="156"><span class="" data-reactid="157"><span data-reactid="158">N/A</span></span></td>
<td class="Ta(end)" data-reactid="159"><span class="" data-reactid="160"><span data-reactid="161">N/A</span></span></td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="162">
<td class="Ta(start)" data-reactid="163">
<span class="" data-reactid="164">
<!-- react-text: 165 -->James Coe<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="166">
<span class="" data-reactid="167">
<!-- react-text: 168 -->Mang. of Market Intelligence & Sustainability<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="169"><span class="" data-reactid="170"><span data-reactid="171">N/A</span></span></td>
<td class="Ta(end)" data-reactid="172"><span class="" data-reactid="173"><span data-reactid="174">N/A</span></span></td>
<td class="Ta(end)" data-reactid="175"><span class="" data-reactid="176"><span data-reactid="177">N/A</span></span></td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="178">
<td class="Ta(start)" data-reactid="179">
<span class="" data-reactid="180">
<!-- react-text: 181 -->Mr. Dominic Dupont Smith<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="182">
<span class="" data-reactid="183">
<!-- react-text: 184 -->Company Sec.<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="185"><span class="" data-reactid="186"><span data-reactid="187">N/A</span></span></td>
<td class="Ta(end)" data-reactid="188"><span class="" data-reactid="189"><span data-reactid="190">N/A</span></span></td>
<td class="Ta(end)" data-reactid="191">
<span class="" data-reactid="192">
<!-- react-text: 193 -->1964<!-- /react-text -->
</span>
</td>
</tr>
<tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="194">
<td class="Ta(start)" data-reactid="195">
<span class="" data-reactid="196">
<!-- react-text: 197 -->Lowana Riddiford<!-- /react-text -->
</span>
</td>
<td class="Ta(start) W(45%)" data-reactid="198">
<span class="" data-reactid="199">
<!-- react-text: 200 -->Warehouse Logistics Coordinator<!-- /react-text -->
</span>
</td>
<td class="Ta(end)" data-reactid="201"><span class="" data-reactid="202"><span data-reactid="203">N/A</span></span></td>
<td class="Ta(end)" data-reactid="204"><span class="" data-reactid="205"><span data-reactid="206">N/A</span></span></td>
<td class="Ta(end)" data-reactid="207"><span class="" data-reactid="208"><span data-reactid="209">N/A</span></span></td>
</tr>
</tbody>
</table>
The fact that there is a <span>
element within each <td>
should not affect IMPORTXML / IMPORTHTML as in other sites it can pick up the text content within <a>
tags no problem.
However the comments for react <!-- react-text: 193 -->1964<!-- /react-text -->
might be causing the Spreadsheet functions to get confused.
Their robots.txt file doesn't seem to contain any clues either:
User-agent: *
Sitemap: https://au.finance.yahoo.com/sitemaps/finance-sitemap_index_AU_en-AU.xml.gz
Disallow: /r/
Disallow: /__rapidworker-1.2.js
Disallow: /__blank
Disallow: /_td_api
Disallow: /_remote
Google's user agent is Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)
So I can't say for sure what about the site makes it go wrong, but it is likely a change on the Yahoo side of things. Maybe they have rate limited Google's IP addresses
I believe you are pushing the limits of what is possible with humble Spreadsheet functions. Which is great, don't get me wrong, however, scraping with a tool that is really not meant for heavy scraping is not very reliable. Considering that web scraping is also not 100% reliable as a source of information as it is constantly subject to the scraped sites changing their HTML structure or leaving things to be dynamically generated by JavaScript. Things will eventually break.
With UrlFetchApp
within Apps Script you have a lot more control and flexibility over how to scrape your data.
Though if this is a serious activity for you which requires dependability, it will pay off to use an API. APISs are designed for websites and sheets and databases to fetch data from them. They welcome it. There are some free offerings about too.