Search code examples
regexlooker-studiore2

Extracting domain without subdomains other then www with RE2 / regex?


This regular expression extracts domain with TLD from urls:

TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(Address, "https?://", ""), R"^(w{3}\.)?", ""), "([^/?]+)"))

This expression makes from http://www.example.com/page.htmlexample.com, which is expected output.

But it works only in case if subdomain is www. In other cases it doesn't process.

If i use instead TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(Address, "https?://", ""), R"^([a-zA-Z0-9]{1,}\.)?", ""), "([^/?]+)")) - the whole domain name disappears and only TLD remains.

Is there any way to process any subdomain, which string can be expressed with [a-zA-Z0-9]{1,}?

Expected output would be

+-------------------+-------------+
| sites.example.com | example.com |
+-------------------+-------------+
| m.example.com     | example.com |
+-------------------+-------------+
| ww2.example.com   | example.com |
+-------------------+-------------+
| mail.example.com  | example.com |
+-------------------+-------------+
| blog.example.com  | example.com |
+-------------------+-------------+
| shop.example.com  | example.com |
+-------------------+-------------+
| cdn.example.com   | example.com |
+-------------------+-------------+
| api.example.com   | example.com |
+-------------------+-------------+
| 1.example.com     | example.com |
+-------------------+-------------+

Solution

  • Where the field is Address, the below does the trick:

    REGEXP_EXTRACT(
     REGEXP_EXTRACT(
      REGEXP_REPLACE(Address, "^(.*//)", ""),
     "^([^/]*)"),
    R"([^\.]*\.[^\.]*)$")
    

    A quick breakdown of Line 3 to Line 5 using an unnested explanation of the Calculated Field:

    • Line 3 REGEXP_REPLACE(Address, "^(.*//)", ""): Removes the http// or https// section
    • Line 4: REGEXP_EXTRACT(Line 3, "^([^/]*)"): Extracts all characters till the first /
    • Line 5: REGEXP_EXTRACT(Line 4, R"([^\.]*\.[^\.]*)$"): Extracts all characters from the end of the string, till the second .

    Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate: