Search code examples
apache-sparkpysparkapache-spark-sql

how to groupby values based on matching values between 2columns using pyspark or sql


Let's suppose we have a dataframe If we observe the data. 4-1234 and 4-1235 codes are related to MUMS12345A tan. and in reverse if we see from tanlist column. MUMS12345A tan is associated with both the codes

CustomerCode TanList
4-1234 MUMS12345A,BLRS12345E,BLRS12345G
4-1235 MUMS12345A,CHED12345A
4-1236 RTKD12345A

i need pyspark or sql code to get the desired output as this is the required output:

CustomerCodeList TANList
4-1234, 4-1235 MUMS12345A,BLRS12345E,BLRS12345G,CHED12345A
4-1236 RTKD12345A

I have provided the inputs and expected result.


Solution

  • Check below code.

    WITH in_cte AS (
        SELECT        
            MAP_FILTER(
                AGGREGATE(
                    COLLECT_LIST(MAP(customer_code,tanlist)) over(order by 1), 
                    MAP('',''), 
                    (acc, e) -> MAP_CONCAT(acc, e) 
                ), 
                (k, v) -> v == tanlist OR ARRAYS_OVERLAP(SPLIT(v, ','), SPLIT(tanlist, ','))
            ) AS output
        FROM VALUES 
            ("4-1234","MUMS12345A,BLRS12345E,BLRS12345G"),
            ("4-1235","MUMS12345A,CHED12345A"),
            ("4-1236","RTKD12345A") 
        AS (customer_code, tanlist)
    )
    SELECT 
        DISTINCT 
        CONCAT_WS(',',MAP_KEYS(output)) AS customer_code, 
        CONCAT_WS(',', MAP_VALUES(output)) AS tanlist 
    FROM in_cte
    
    +-------------+------------------------------------------------------+
    |customer_code|tanlist                                               |
    +-------------+------------------------------------------------------+
    |4-1234,4-1235|MUMS12345A,BLRS12345E,BLRS12345G,MUMS12345A,CHED12345A|
    |4-1236       |RTKD12345A                                            |
    +-------------+------------------------------------------------------+