how to groupby values based on matching values between 2columns using pyspark or sql

Let's suppose we have a dataframe If we observe the data. 4-1234 and 4-1235 codes are related to MUMS12345A tan. and in reverse if we see from tanlist column. MUMS12345A tan is associated with both the codes

CustomerCode	TanList
4-1234	MUMS12345A,BLRS12345E,BLRS12345G
4-1235	MUMS12345A,CHED12345A
4-1236	RTKD12345A

i need pyspark or sql code to get the desired output as this is the required output:

CustomerCodeList	TANList
4-1234, 4-1235	MUMS12345A,BLRS12345E,BLRS12345G,CHED12345A
4-1236	RTKD12345A

I have provided the inputs and expected result.

Solution

Check below code.

WITH in_cte AS (
    SELECT        
        MAP_FILTER(
            AGGREGATE(
                COLLECT_LIST(MAP(customer_code,tanlist)) over(order by 1), 
                MAP('',''), 
                (acc, e) -> MAP_CONCAT(acc, e) 
            ), 
            (k, v) -> v == tanlist OR ARRAYS_OVERLAP(SPLIT(v, ','), SPLIT(tanlist, ','))
        ) AS output
    FROM VALUES 
        ("4-1234","MUMS12345A,BLRS12345E,BLRS12345G"),
        ("4-1235","MUMS12345A,CHED12345A"),
        ("4-1236","RTKD12345A") 
    AS (customer_code, tanlist)
)
SELECT 
    DISTINCT 
    CONCAT_WS(',',MAP_KEYS(output)) AS customer_code, 
    CONCAT_WS(',', MAP_VALUES(output)) AS tanlist 
FROM in_cte

+-------------+------------------------------------------------------+
|customer_code|tanlist                                               |
+-------------+------------------------------------------------------+
|4-1234,4-1235|MUMS12345A,BLRS12345E,BLRS12345G,MUMS12345A,CHED12345A|
|4-1236       |RTKD12345A                                            |
+-------------+------------------------------------------------------+