This regular expression extracts domain with TLD from urls:
TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(Address, "https?://", ""), R"^(w{3}\.)?", ""), "([^/?]+)"))
This expression makes from http://www.example.com/page.html
→ example.com
, which is expected output.
But it works only in case if subdomain is www
. In other cases it doesn't process.
If i use instead TRIM(REGEXP_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(Address, "https?://", ""), R"^([a-zA-Z0-9]{1,}\.)?", ""), "([^/?]+)"))
- the whole domain name disappears and only TLD remains.
Is there any way to process any subdomain, which string can be expressed with [a-zA-Z0-9]{1,}
?
Expected output would be
+-------------------+-------------+
| sites.example.com | example.com |
+-------------------+-------------+
| m.example.com | example.com |
+-------------------+-------------+
| ww2.example.com | example.com |
+-------------------+-------------+
| mail.example.com | example.com |
+-------------------+-------------+
| blog.example.com | example.com |
+-------------------+-------------+
| shop.example.com | example.com |
+-------------------+-------------+
| cdn.example.com | example.com |
+-------------------+-------------+
| api.example.com | example.com |
+-------------------+-------------+
| 1.example.com | example.com |
+-------------------+-------------+
Where the field is Address
, the below does the trick:
REGEXP_EXTRACT(
REGEXP_EXTRACT(
REGEXP_REPLACE(Address, "^(.*//)", ""),
"^([^/]*)"),
R"([^\.]*\.[^\.]*)$")
A quick breakdown of Line 3 to Line 5 using an unnested explanation of the Calculated Field:
REGEXP_REPLACE(Address, "^(.*//)", "")
: Removes the http//
or https//
sectionREGEXP_EXTRACT(Line 3, "^([^/]*)")
: Extracts all characters till the first /
REGEXP_EXTRACT(Line 4, R"([^\.]*\.[^\.]*)$")
: Extracts all characters from the end of the string, till the second .
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate: