Search code examples
regexgoogle-cloud-platformre2google-logging

How to match URL path using Google RE2 regex


Google Cloud Platform lets you create label logs using the RE2 regex engine.

How can I create a regex that matches the path in the URL?

Examples matches:

https://example.com/awesome                  --> "awesome"
https://example.com/awesome/path             --> "awesome/path"
https://example.com/awesome/path/            --> "awesome/path"
https://example.com/awesome/path?arg1=123    --> "awesome/path"

Details:

  • The domain and protocol are constant, it can be assumed to be https://example.com here.
  • If there are multiple directories, they should be matched too, including the / in between.
  • Trailing / should NOT be matched.
  • Queries, e.g. ?arg1=123&arg2=456 should NOT be matched.
  • It can be assumed that directory names will only contain alphanumeric characters a-zA-Z0-9, dashes - and underscores _.

Note that Google RE2 is different than PCRE2.


Solution

  • So the syntax isn't 100% clear what is supported and what isn't. Assuming (NOT SUPPORTED) VIM means it is supported but not on vim, I'd start with a negative look behind for the beginning of the url that you don't care about

    (?<=https:\/\/example\.com\/)
    

    Then you want alphanumeric characters [\w\-]+ followed by non trailing / so I'd add a lookahead to verify that there are alphanumeric characters after the / with (?=\/\w+)\/

    The complete regex

    (?<=https:\/\/example\.com\/)([\w\-]+((?=\/\w+)\/|\b))+