I'm using SQL in Azure databricks. My databricks version is 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12).
I have a table with 5 columns of mail id & mail id's are repeating in some columns for same row.
My sample table looks like this:
| Mail_1 | Mail_2 | Mail_3 | Mail_4 | Mail_5 |
|--------------|--------------|--------------|--------------|--------------|
|a.gmail.com | a.aol.com | a1.gmail.com | a.gmail.com | a1.gmail.com |
|b.gmail.com | b.aol.com | b1.gmail.com | b.yahoo.com | null |
I need to combine these columns as & create a column Mail with unique record.
My additional output column will look like:
| Mail
|-----------------------------------------------------------------|
|a.gmail.com,a.aol.com,a1.gmail.com |
|b.gmail.com,b.aol.com,b1.gmail.com,b.yahoo.com |
My mail_1
to mail_5
columns might have null value.
Can you please suggest how to do this?
If wanting to remove null, you can use array_except
. If you don't want the output to look like an array, then you could use concat_ws
combined with array_distinct
.
with my_data as (
select 1 as id, 'a.gmail.com' as Mail_1, 'a.aol.com' as Mail_2, 'a1.gmail.com' as Mail_3, 'a.gmail.com' as Mail_4, 'a1.gmail.com' as Mail_5 union all
select 2, 'b.gmail.com', 'b.aol.com', 'b1.gmail.com', 'b.yahoo.com', null
)
select
id,
array_except(array(mail_1, mail_2, mail_3, mail_4, mail_5), array(null)) as mail,
concat_ws(',', array_distinct(array(mail_1, mail_2, mail_3, mail_4, mail_5)), NULL) as mailed
from my_data
order by id
OUTPUT:
id mail mailed
1 ["a.gmail.com", "a.aol.com", "a1.gmail.com"] a.gmail.com,a.aol.com,a1.gmail.com
2 ["b.gmail.com", "b.aol.com", "b1.gmail.com", "b.yahoo.com"] b.gmail.com,b.aol.com,b1.gmail.com,b.yahoo.com