Search code examples
google-sheetsweb-scrapingxml-parsinggoogle-sheets-formula

Trouble Extracting Elements from Google Support Pages Using IMPORTXML in Google Sheets


I am attempting to use the IMPORTXML function in Google Sheets to extract elements from a specific Google Support page. The formula works fine with other URLs, but when I use it on a Google Support page, it returns an Could not fetch url error:

=IMPORTXML("https://support.google.com/looker-studio/answer/11521624?hl=en", "//h2")

  • The headings definitely exist on the page.
  • The page does not require a login and is publicly accessible.
  • The IMPORTXML function works successfully with other non-Google websites.
  • I would prefer to solve this using IMPORTXML itself and not other methods like Google Apps Script or manual copying.

Is this a specific limitation with IMPORTXML for Google's own webpages? Are there any known workarounds or specific adjustments to the IMPORTXML query that might bypass this issue, while still using IMPORTXML?


Solution

  • There are limitations with IMPORTXML.

    IMPORTXML can only import data from any various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.

    It appears the data you are trying to import is dynamically loaded by JavaScript or misformed, therefore it is not retrievable by IMPORTXML.

    Update From Comments:

    There is usually always an alternative if you are open to scripts. There are a lot of examples out there for fetching and web scraping. For example you can use the following script to extract all the content between tags at the specified URL and return an array.

    Just navigate to Extensions > Apps Scripts and post this in the Code.gs file and click save.

    function fetchAllH2Text(url) {
      const html = UrlFetchApp.fetch(url).getContentText();
    
      // Use regular expression to match content between <h2> tags
      const h2Matches = html.match(/<h2[^>]*>([\s\S]*?)<\/h2>/g);
    
      if (h2Matches) {
        // Extract text content from the matches
        const h2TextArray = h2Matches.map(match => match.replace(/<\/?h2[^>]*>/g, '').trim());
        return h2TextArray;
      } else {
        return ["No <h2> tags found"];
      }
    }
    

    This will allow you to use a custom function like this:

    =fetchAllH2Text("https://support.google.com/looker-studio/answer/11521624?hl=en")

    Due to the URL not having well-formed XML which is the reason IMPORTXML did not work in the first place, we use a regular expression to extract text between tags. This results in some content still being wrapped in a tag within that tag. From here you can use a standard formula to clean up what you're looking for.

    example

    Update 2:

    Or if you want to clean up the data via the regular expression you can use this script, which will only return long dates from the matches.

    function fetchDatesFromH2Tags(url) {
      const html = UrlFetchApp.fetch(url).getContentText();
    
      // Use regular expression to match content between <h2> tags
      const h2Matches = html.match(/<h2[^>]*>(?:<a[^>]*>.*?<\/a>)?(?:January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}<\/h2>/gi);
    
      if (h2Matches) {
        // Extract text content from the matches
        const h2TextArray = h2Matches.map(match => match.replace(/<\/?h2[^>]*>|<a[^>]*>|<\/a>/g, '').trim());
        return h2TextArray;
      } else {
        return ["No <h2> tags found"];
      }
    }
    

    Be sure to use new function name =fetchDatesFromH2Tags("https://support.google.com/looker-studio/answer/11521624?hl=en")