Search code examples
sqlazure-databricks

How to combine distinct records of multiple columns in SQL


I'm using SQL in Azure databricks. My databricks version is 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12).

I have a table with 5 columns of mail id & mail id's are repeating in some columns for same row.

My sample table looks like this:

|    Mail_1    |    Mail_2    |    Mail_3    |    Mail_4    |    Mail_5    |
|--------------|--------------|--------------|--------------|--------------|
|a.gmail.com   | a.aol.com    | a1.gmail.com | a.gmail.com  | a1.gmail.com |
|b.gmail.com   | b.aol.com    | b1.gmail.com | b.yahoo.com  | null         |

I need to combine these columns as & create a column Mail with unique record.

My additional output column will look like:

|                              Mail                               
|-----------------------------------------------------------------|
|a.gmail.com,a.aol.com,a1.gmail.com                               |
|b.gmail.com,b.aol.com,b1.gmail.com,b.yahoo.com                   |

My mail_1 to mail_5 columns might have null value.

Can you please suggest how to do this?


Solution

  • If wanting to remove null, you can use array_except. If you don't want the output to look like an array, then you could use concat_ws combined with array_distinct.

    with my_data as (
      select 1 as id, 'a.gmail.com' as Mail_1, 'a.aol.com' as Mail_2, 'a1.gmail.com' as Mail_3, 'a.gmail.com' as Mail_4, 'a1.gmail.com' as Mail_5 union all
      select 2, 'b.gmail.com', 'b.aol.com', 'b1.gmail.com', 'b.yahoo.com', null
     )
    select 
      id, 
      array_except(array(mail_1, mail_2, mail_3, mail_4, mail_5), array(null)) as mail, 
      concat_ws(',', array_distinct(array(mail_1, mail_2, mail_3, mail_4, mail_5)), NULL) as mailed
    from my_data
    order by id
    

    OUTPUT:

    id  mail                                                             mailed
    1   ["a.gmail.com", "a.aol.com", "a1.gmail.com"]                     a.gmail.com,a.aol.com,a1.gmail.com
    2   ["b.gmail.com", "b.aol.com", "b1.gmail.com", "b.yahoo.com"]      b.gmail.com,b.aol.com,b1.gmail.com,b.yahoo.com