Let's suppose we have a dataframe
If we observe the data. 4-1234
and 4-1235
codes are related to MUMS12345A
tan. and in reverse if we see from tanlist column. MUMS12345A
tan is associated with both the codes
CustomerCode | TanList |
---|---|
4-1234 | MUMS12345A,BLRS12345E,BLRS12345G |
4-1235 | MUMS12345A,CHED12345A |
4-1236 | RTKD12345A |
i need pyspark or sql code to get the desired output as this is the required output:
CustomerCodeList | TANList |
---|---|
4-1234, 4-1235 | MUMS12345A,BLRS12345E,BLRS12345G,CHED12345A |
4-1236 | RTKD12345A |
I have provided the inputs and expected result.
Check below code.
WITH in_cte AS (
SELECT
MAP_FILTER(
AGGREGATE(
COLLECT_LIST(MAP(customer_code,tanlist)) over(order by 1),
MAP('',''),
(acc, e) -> MAP_CONCAT(acc, e)
),
(k, v) -> v == tanlist OR ARRAYS_OVERLAP(SPLIT(v, ','), SPLIT(tanlist, ','))
) AS output
FROM VALUES
("4-1234","MUMS12345A,BLRS12345E,BLRS12345G"),
("4-1235","MUMS12345A,CHED12345A"),
("4-1236","RTKD12345A")
AS (customer_code, tanlist)
)
SELECT
DISTINCT
CONCAT_WS(',',MAP_KEYS(output)) AS customer_code,
CONCAT_WS(',', MAP_VALUES(output)) AS tanlist
FROM in_cte
+-------------+------------------------------------------------------+
|customer_code|tanlist |
+-------------+------------------------------------------------------+
|4-1234,4-1235|MUMS12345A,BLRS12345E,BLRS12345G,MUMS12345A,CHED12345A|
|4-1236 |RTKD12345A |
+-------------+------------------------------------------------------+