I am attempting to use the IMPORTXML function in Google Sheets to extract elements from a specific Google Support page. The formula works fine with other URLs, but when I use it on a Google Support page, it returns an Could not fetch url error:
=IMPORTXML("https://support.google.com/looker-studio/answer/11521624?hl=en", "//h2")
Is this a specific limitation with IMPORTXML for Google's own webpages? Are there any known workarounds or specific adjustments to the IMPORTXML query that might bypass this issue, while still using IMPORTXML?
There are limitations with IMPORTXML
.
IMPORTXML
can only import data from any various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.
It appears the data you are trying to import is dynamically loaded by JavaScript or misformed, therefore it is not retrievable by IMPORTXML.
Update From Comments:
There is usually always an alternative if you are open to scripts. There are a lot of examples out there for fetching and web scraping. For example you can use the following script to extract all the content between tags at the specified URL and return an array.
Just navigate to Extensions > Apps Scripts and post this in the Code.gs file and click save.
function fetchAllH2Text(url) {
const html = UrlFetchApp.fetch(url).getContentText();
// Use regular expression to match content between <h2> tags
const h2Matches = html.match(/<h2[^>]*>([\s\S]*?)<\/h2>/g);
if (h2Matches) {
// Extract text content from the matches
const h2TextArray = h2Matches.map(match => match.replace(/<\/?h2[^>]*>/g, '').trim());
return h2TextArray;
} else {
return ["No <h2> tags found"];
}
}
This will allow you to use a custom function like this:
=fetchAllH2Text("https://support.google.com/looker-studio/answer/11521624?hl=en")
Due to the URL not having well-formed XML which is the reason IMPORTXML
did not work in the first place, we use a regular expression to extract text between tags. This results in some content still being wrapped in a tag within that tag. From here you can use a standard formula to clean up what you're looking for.
Update 2:
Or if you want to clean up the data via the regular expression you can use this script, which will only return long dates from the matches.
function fetchDatesFromH2Tags(url) {
const html = UrlFetchApp.fetch(url).getContentText();
// Use regular expression to match content between <h2> tags
const h2Matches = html.match(/<h2[^>]*>(?:<a[^>]*>.*?<\/a>)?(?:January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}<\/h2>/gi);
if (h2Matches) {
// Extract text content from the matches
const h2TextArray = h2Matches.map(match => match.replace(/<\/?h2[^>]*>|<a[^>]*>|<\/a>/g, '').trim());
return h2TextArray;
} else {
return ["No <h2> tags found"];
}
}
Be sure to use new function name =fetchDatesFromH2Tags("https://support.google.com/looker-studio/answer/11521624?hl=en")