Search code examples
sqlsnowflake-cloud-data-platformunique

SQL - Identify unique person in a customer table with several customer ids, phone numbers and email


I have a customer table with a unique customers id, a customer phone number and a customer email. It could happen, that a unique person can have several customer id's, but the same email and/or the same phone number. Just like this:

| customer id  | customer email  | customer phone number  | person id  |
|:-------------|:----------------|:-----------------------|:-----------|
| 1            |[email protected]  | 0111 111111            | 1          |
| 2            |[email protected]  | 0999 999999            | 1          |
| 3            |[email protected]   | 0111 111111            | 1          | 
| 4            |[email protected]   | 0555 555555            | 1          |
| 5            |[email protected]   | 0555 555555            | 1          |
| 6            |[email protected] | 0333 333333            | 6          |
| 7            |[email protected] | 0333 333333            | 6          |

I would like to check with a SQL query in Snowflake, are there any mathes between email and phone number and create a new unique person id (e.g. customer ids 1 to 5 seem to be the same unique person because of the email and phone number and the customer id 6 and 7 are one unique person, because the phone numbers are the same). The smallest customer id of this unique person should always be taken as the person id.

How can I identify this way unique persons with a SQL query?

Thank you so much for your help & best Sophie


Solution

  • This is a bit ugly, and may be improved with a window function or recursive CTE, but this is quick and dirty and generates the results you posted. It requires two passes over the data because your result set shows a transitive relationship on the data.

    -- Test table
        insert into
            customer (id, email, phone, person_id)
        values
            (1, '[email protected]', '0111 111111', null),(2, '[email protected]', '0999 999999', null),
            (3, '[email protected]', '0111 111111', null),(4, '[email protected]', '0555 555555', null),
            (5, '[email protected]', '0555 555555', null),(6, '[email protected]', '0333 333333', null),(7, '[email protected]', '0333 333333', null);
     
    -- CTE to reflect anchor query and 2 passes to produce the results
       
        with cte as (select
            A.id,
            A.email,
            A.phone,
            min(B.id) as min_id
        from
            customer A,
            customer B
        where
            (A.email = B.email)
            or (A.phone = B.phone)
        group by
            A.id,
            A.email,
            A.phone),
        t1 as (select email, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from cte
            group by email),
        t2 as (select phone, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from cte
            group by phone),
        t3 as     
            (select cte.id, cte.email, cte.phone, case when min(t1.min_id) < min(t2.min_id) then min(t1.min_id) else min(t2.min_id) end min_id
            from cte, t1, t2
            where (cte.email = t1.email) and (cte.phone = t2.phone)
            group by 1,2,3),
        t4 as (select email, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from t3
            group by email),
        t5 as (select phone, case when min(id) < min(min_id) then min(id) else min(min_id) end min_id from t3
            group by phone)
            
        select t3.id, t3.email, t3.phone, case when min(t4.min_id < t3.min_id) then min(t4.min_id) when min(t5.min_id < t3.min_id) then min(t5.min_id) else min(t3.min_id) end person_id
        from t3, t4, t5
        where (t4.email = t3.email) and (t5.phone = t3.phone)
        group by 1,2,3
            order by t3.id;
    
    -- Results
    ID  EMAIL               PHONE       PERSON_ID
    1   [email protected]     0111 111111 1
    2   [email protected]     0999 999999 1
    3   [email protected]      0111 111111 1
    4   [email protected]      0555 555555 1
    5   [email protected]      0555 555555 1
    6   [email protected]    0333 333333 6
    7   [email protected]    0333 333333 6