Search code examples
google-sheetsweb-scrapinggoogle-sheets-formulayahoo-finance

IMPORTHTML/IMPORTXML suddenly returning 'resource at URL not found'


My importhtml and importxml functions in Google Sheets suddenly stopped working. I'm guessing it has something to do with the URL as it has been working for about a month (since I set it up) and has only recently stopped (last week). Now half the sheets is displaying #N/A unfortunately. Just wondering if someone more advanced in this area could share their insight into the problem.

IMPORTXML example:

URL: https://au.finance.yahoo.com/quote/AZJ.AX/profile?p=AZJ.AX

xPath: //*[@id='Col1-0-Profile-Proxy']/section/div[1]/div/div/p[2]/span[4]

Function returning "Resource at URL not found".

Here is a link to an example file: https://docs.google.com/spreadsheets/d/1ml3cCHIEk4vTCn3GyymXGxyCpVBgMprMzGP7OYhKPC0/edit?usp=sharing


Solution

  • EDIT:

    This looks like it may in fact be a issue on Google's side:

    https://issuetracker.google.com/175144626

    There has not been any resolution, but go and star that issue if it affects you or you want updates.


    Maybe Yahoo has changed its site structure

    I have not used the Yahoo site a lot, but maybe it has changed its format in a way that makes it unreadable for IMPORTXML. I also tested it with IMPORTHTML, with the same result. The function itself is still working with other sites though.

    Testing the site you gave, disabling JavaScript to ensure the data is not dynamically generated (which makes it unreadable for IMPORTHTML), there is in fact a table:

    <table class="W(100%)" data-reactid="36">
       <thead data-reactid="37">
          <tr class="C($tertiaryColor) Fz(xs) BdB Bdc($seperatorColor)" data-reactid="38">
             <th class="Ta(start) Py(6px) Fw(n)" data-reactid="39"><span data-reactid="40">Name</span></th>
             <th class="Ta(start) Py(6px) Fw(n)" data-reactid="41"><span data-reactid="42">Title</span></th>
             <th class="Ta(end) Py(6px) Fw(n)" data-reactid="43"><span data-reactid="44">Pay</span></th>
             <th class="Ta(end) Py(6px) Fw(n)" data-reactid="45"><span data-reactid="46">Exercised</span></th>
             <th class="Ta(end) Py(6px) Fw(n)" data-reactid="47"><span data-reactid="48">Year born</span></th>
          </tr>
       </thead>
       <tbody data-reactid="49">
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="50">
             <td class="Ta(start)" data-reactid="51">
                <span class="" data-reactid="52">
                   <!-- react-text: 53 -->Mr. Andrew T. Harding<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="54">
                <span class="" data-reactid="55">
                   <!-- react-text: 56 -->MD, CEO &amp; Director<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="57">
                <span class="" data-reactid="58">
                   <!-- react-text: 59 -->2.78M<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="60"><span class="" data-reactid="61"><span data-reactid="62">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="63">
                <span class="" data-reactid="64">
                   <!-- react-text: 65 -->1967<!-- /react-text -->
                </span>
             </td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="66">
             <td class="Ta(start)" data-reactid="67">
                <span class="" data-reactid="68">
                   <!-- react-text: 69 -->Mr. George  Lippiatt<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="70">
                <span class="" data-reactid="71">
                   <!-- react-text: 72 -->CFO &amp; Group Exec. Strategy<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="73">
                <span class="" data-reactid="74">
                   <!-- react-text: 75 -->481.25k<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="76"><span class="" data-reactid="77"><span data-reactid="78">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="79"><span class="" data-reactid="80"><span data-reactid="81">N/A</span></span></td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="82">
             <td class="Ta(start)" data-reactid="83">
                <span class="" data-reactid="84">
                   <!-- react-text: 85 -->Mr. Ed  McKeiver<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="86">
                <span class="" data-reactid="87">
                   <!-- react-text: 88 -->Group Exec. of Coal<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="89">
                <span class="" data-reactid="90">
                   <!-- react-text: 91 -->861k<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="92"><span class="" data-reactid="93"><span data-reactid="94">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="95"><span class="" data-reactid="96"><span data-reactid="97">N/A</span></span></td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="98">
             <td class="Ta(start)" data-reactid="99">
                <span class="" data-reactid="100">
                   <!-- react-text: 101 -->Mr. Michael G. Carter BEng, BBus, MAICD, AFAIM<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="102">
                <span class="" data-reactid="103">
                   <!-- react-text: 104 -->Group Exec. of Technical Services &amp; Planning<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="105"><span class="" data-reactid="106"><span data-reactid="107">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="108"><span class="" data-reactid="109"><span data-reactid="110">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="111"><span class="" data-reactid="112"><span data-reactid="113">N/A</span></span></td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="114">
             <td class="Ta(start)" data-reactid="115">
                <span class="" data-reactid="116">
                   <!-- react-text: 117 -->Mr. Christopher L. Vagg<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="118">
                <span class="" data-reactid="119">
                   <!-- react-text: 120 -->Head of Investor Relations &amp; Group Treasurer<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="121"><span class="" data-reactid="122"><span data-reactid="123">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="124"><span class="" data-reactid="125"><span data-reactid="126">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="127"><span class="" data-reactid="128"><span data-reactid="129">N/A</span></span></td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="130">
             <td class="Ta(start)" data-reactid="131">
                <span class="" data-reactid="132">
                   <!-- react-text: 133 -->Mark  Hairsine<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="134">
                <span class="" data-reactid="135">
                   <!-- react-text: 136 -->Mang. of Media &amp; Communications<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="137"><span class="" data-reactid="138"><span data-reactid="139">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="140"><span class="" data-reactid="141"><span data-reactid="142">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="143"><span class="" data-reactid="144"><span data-reactid="145">N/A</span></span></td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="146">
             <td class="Ta(start)" data-reactid="147">
                <span class="" data-reactid="148">
                   <!-- react-text: 149 -->Ms. Tina  Thomas<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="150">
                <span class="" data-reactid="151">
                   <!-- react-text: 152 -->Group Exec. of Corp.<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="153"><span class="" data-reactid="154"><span data-reactid="155">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="156"><span class="" data-reactid="157"><span data-reactid="158">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="159"><span class="" data-reactid="160"><span data-reactid="161">N/A</span></span></td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="162">
             <td class="Ta(start)" data-reactid="163">
                <span class="" data-reactid="164">
                   <!-- react-text: 165 -->James  Coe<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="166">
                <span class="" data-reactid="167">
                   <!-- react-text: 168 -->Mang. of Market Intelligence &amp; Sustainability<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="169"><span class="" data-reactid="170"><span data-reactid="171">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="172"><span class="" data-reactid="173"><span data-reactid="174">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="175"><span class="" data-reactid="176"><span data-reactid="177">N/A</span></span></td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="178">
             <td class="Ta(start)" data-reactid="179">
                <span class="" data-reactid="180">
                   <!-- react-text: 181 -->Mr. Dominic Dupont Smith<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="182">
                <span class="" data-reactid="183">
                   <!-- react-text: 184 -->Company Sec.<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="185"><span class="" data-reactid="186"><span data-reactid="187">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="188"><span class="" data-reactid="189"><span data-reactid="190">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="191">
                <span class="" data-reactid="192">
                   <!-- react-text: 193 -->1964<!-- /react-text -->
                </span>
             </td>
          </tr>
          <tr class="C($primaryColor) BdB Bdc($seperatorColor) H(36px)" data-reactid="194">
             <td class="Ta(start)" data-reactid="195">
                <span class="" data-reactid="196">
                   <!-- react-text: 197 -->Lowana  Riddiford<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(start) W(45%)" data-reactid="198">
                <span class="" data-reactid="199">
                   <!-- react-text: 200 -->Warehouse Logistics Coordinator<!-- /react-text -->
                </span>
             </td>
             <td class="Ta(end)" data-reactid="201"><span class="" data-reactid="202"><span data-reactid="203">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="204"><span class="" data-reactid="205"><span data-reactid="206">N/A</span></span></td>
             <td class="Ta(end)" data-reactid="207"><span class="" data-reactid="208"><span data-reactid="209">N/A</span></span></td>
          </tr>
       </tbody>
    </table>
    

    The fact that there is a <span> element within each <td> should not affect IMPORTXML / IMPORTHTML as in other sites it can pick up the text content within <a> tags no problem.

    However the comments for react <!-- react-text: 193 -->1964<!-- /react-text --> might be causing the Spreadsheet functions to get confused.

    Their robots.txt file doesn't seem to contain any clues either:

    User-agent: *
    Sitemap: https://au.finance.yahoo.com/sitemaps/finance-sitemap_index_AU_en-AU.xml.gz
    Disallow: /r/
    Disallow: /__rapidworker-1.2.js
    Disallow: /__blank
    Disallow: /_td_api
    Disallow: /_remote
    

    Google's user agent is Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

    So I can't say for sure what about the site makes it go wrong, but it is likely a change on the Yahoo side of things. Maybe they have rate limited Google's IP addresses


    Use UrlFetchApp or better yet, an actual API

    I believe you are pushing the limits of what is possible with humble Spreadsheet functions. Which is great, don't get me wrong, however, scraping with a tool that is really not meant for heavy scraping is not very reliable. Considering that web scraping is also not 100% reliable as a source of information as it is constantly subject to the scraped sites changing their HTML structure or leaving things to be dynamically generated by JavaScript. Things will eventually break.

    With UrlFetchApp within Apps Script you have a lot more control and flexibility over how to scrape your data.

    Though if this is a serious activity for you which requires dependability, it will pay off to use an API. APISs are designed for websites and sheets and databases to fetch data from them. They welcome it. There are some free offerings about too.