Search code examples
regexlooker-studiore2

How to use RegEx to get part of redirect url?


I have a column with list redirect URL on Google Custom Search Results. I would like to extract the external domain from that combined URL.

Example:

  1. https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite1.co.uk/aa-vv--cc-dd-gggg-/&sa=U&ved=2ahUKEwjj1cvJ79PuAhXBHc0KHRgvBLsgQIAhAC&usg=AOvVaw2vIHUiy31YKWs5c41Q

  2. https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=http://www.exmaplesite2.co.uk/wp-content/uploads/2016/12/research-paper.pdf&sa=U&ved=2ahUKEwiphLKMi80KHcLUCMAQFjAFegQIARAC&usg=AOvVawkm-bXjmxsPxLQ9w3

  3. https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite-3.com/home/en/aaa-bbb/38376&sa=U&ved=2ahUKEwixq4K7qttXEKHTOEClsQFjAAegQIARAB&usg=AOvVaw2ouHhfNNTPV

From Above URL's, I would like to extract the external domain name

Results from above examples:

  1. www.site2.co.uk
  2. www.exmaplesite2.co.uk
  3. examplesite-3.com

I am able to do this in Google Sheet, but need RedEx so that I can use it in Google Data Studio.

Thanks.


Solution

  • You may use this regex with an additional negative lookbehind:

    (?<=(?<!^https)://)[^/]+
    

    RegEx Demo

    RegEx Details:

    • (?<=(?<!^https)://): Positive lookbehind to assert that we have :// before current position. Additionally nested negative lookbehind (?<!^https) asserts that we don't have starting https before :// thus skipping matching starting URLs
    • [^/]+: Match 1+ of any character that is not /`

    Update: As per comments below lookbehind is not supported in Google Data Studio, hence we can use this regex:

    .https?://([^/]+)
    

    And grab domain name from capture group #1.

    . placed before https?: will ensure that we don't match a URL at the start of a line.