Search code examples
regexgoogle-apps-scriptgoogle-docs-api

Get path element from URL using findText()


Say you have the following paragraph in a Google Doc and you want to pull the element out of the url that relates to a car.

Some paragraph with some data in it has a url http://example.com/ford/some/other/data.html. There is also another link: http://example.com/ford/latest.html.

What I am looking for is pulling "ford" out of this paragraph so I can use it. And for the sake of simplicity I know the paragraph number, I will just call it "1" down below.

I have tried:

function getData() {
  var paragraphs = DocumentApp.getActiveDocument().getBody().getParagraphs();
  var element = paragraphs[1];
  var re = element.findText('http://example.com/([a-z])+/');
  var data = re.getElement().asText().getText();
  Logger.log(data);
}

The problem is that data contains the entire paragraph text.

Also is there a way to capture and use the data from a capturing group, aka the content in the ()?


Solution

  • I believe your goal like below.

    • You want to retrieve the value of ford from the values like http://example.com/ford/latest.html and http://example.com/ford/some/other/data.html using Google Apps Script.
    • Those values are put in Google Document.

    For this, how about this modification?

    Modification points:

    In your script, when element.findText('http://example.com/([a-z])+/') has a value, re.getElement().asText().getText() is the text of the paragraph. In this case, it is found that the text with the pattern by element.findText() is including in element. Using this, how about retrieving the values like ford from re.getElement().asText().getText()?

    Modified script:

    From:
    var data = re.getElement().asText().getText();
    Logger.log(data);
    
    To:
    if (re) {
      var data = [...re.getElement().asText().getText().matchAll(/http:\/\/example\.com\/([\w\S]+?)\//g)];
      console.log(data.map(([,e]) => e));
    } else {
      throw "Not match."
    }
    
    • When the paragraph has no values which maches to your regex, re is null. Please be careful.

    Note:

    • Please use the script with enabling V8.

    Reference: