Search code examples
regexpostgresqlurl

Extract domain from url using PostgreSQL


I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.

 SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain", 
 COUNT(DISTINCT(user)) AS "Unique Users"
 FROM db
 GROUP BY 1
 ORDER BY 2 DESC;

This seems unfavorable as the query needs to be constantly updated for list of unwanted words.

I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?

A dataset sample to try with:

 CREATE TABLE sample (
 url VARCHAR(100) NOT NULL);

 INSERT INTO sample url) 
 VALUES 
 ("sample.co.uk"),
 ("www.sample.co.uk"),
 ("www3.sample.co.uk"),
 ("biz.sample.co.uk"),
 ("digital.testing.sam.co"),
 ("sam.co"),
 ("m.sam.co");

Desired output

+------------------------+--------------+
|    url                 |  domain      |
+------------------------+--------------+
| sample.co.uk           | sample.co.uk |
| www.sample.co.uk       | sample.co.uk |
| www3.sample.co.uk      | sample.co.uk |
| biz.sample.co.uk       | sample.co.uk |
| digital.testing.sam.co | sam.co       |
| sam.co                 | sam.co       |
| m.sam.co               | sam.co       |
+------------------------+--------------+

Solution

  • So, I've found the solution using Jeremy and Rémy Baron's answer.

    1. Extract all the public suffix from public suffix and store into a table which I labelled as tlds.

    2. Get the unique urls in the dataset and match to its TLD. part1

    3. Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output: final_output

    The SQL query is as below:

    WITH stored_tld AS(
    SELECT 
    DISTINCT(s.url),
    FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
                                rows between unbounded preceding and unbounded following) AS "tld" 
    FROM sample s 
    JOIN tlds t 
    ON (s.url like '%%'||domain))
    
    SELECT 
    t1.url,
    CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2') 
    END AS "extracted_domain" 
    FROM(
        SELECT a.url,st."tld"
        FROM sample a
        LEFT JOIN stored_tld st
        ON a.url = st.url
        )t1
    

    Links to try: SQL Tester