Search code examples
javascriptgoogle-apps-scriptgoogle-sheetscustom-function

How can I customize the existing script to import multiple values of identical regex paths


I use the script from this post to import the value via a regex path To exceed the ImportXML limit on Google Spreadsheet

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return unescapeHTML(output);
}

var htmlEntities = {
  nbsp:  ' ',
  cent:  '¢',
  pound: '£',
  yen:   '¥',
  euro:  '€',
  copy:  '©',
  reg:   '®',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '\''
};

function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;

        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
}; 

The problem is that this way only one value is imported, e.g. only one value from H2, even if there are several H2 tags on the page. How do I adjust the code to output all values that match the regex path?


Solution

  • Try this:

    function importRegex(url, regexInput) {
      var output = [];
      var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
      if (fetchedUrl) {
        var html = fetchedUrl.getContentText();
        if (html.length && regexInput.length) {
          var results = html.matchAll(new RegExp(regexInput, 'ig'));
          for (result of results){
           output.push(result[1])
          }
        }
      }
      // Grace period to not overload
      Utilities.sleep(1000);
      output = output.map(x => unescapeHTML(x));
      return output;
    }
    

    Note that the required changes are only to the importRegex() function but you still need to keep htmlEntities and unescapeHTML() as they were.

    Explanation:

    The original function uses match() which returns only one result from the regex. If you want more results then you can use matchAll() and loop through the iterator to build the output array.

      var results = html.matchAll(new RegExp(regexInput, 'ig'));
        for (result of results){
          output.push(result[1])
        }
    

    After that to run every element through the unescapeHTML function you can use map():

      output = output.map(x => unescapeHTML(x));
    

    As a test we can try the following call to get the h2 headers from the Wikipedia Soap page:

    =IMPORTREGEX("https://en.wikipedia.org/wiki/Soap", "<h2><(?:.*?)>(.*?)<\/span>")
    

    This will expand the array over however many rows are needed:

    enter image description here

    The tricky part is getting the regex right, but this will depend on the site that you want to scrape.