Search code examples
pysparkconcatenation

concat all distinct value of several columns into a column in Pyspark


I have a dataset like this :

C1 C2 C3
a e i
b f j
c g k
d h l

I want to obtain this :

C1 C2 C3 C4
a e i a b c d e f g h i j k l
b f j a b c d e f g h i j k l
c g k a b c d e f g h i j k l
d h l a b c d e f g h i j k l

Do you know how to do this in python Pyspark ? Thanks !


Solution

  • Try with window function and then use array_sort,flatten functions with collect_set.

    Example:

    from pyspark.sql.functions import *
    from pyspark.sql import *
    w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    df2= df.withColumn("temp_c1",collect_set(col("c1")).over(w))\
      .withColumn("temp_c2",collect_set(col("c2")).over(w))\
      .withColumn("temp_c3",collect_set(col("c3")).over(w))\
      .withColumn("c4",array_join(array_distinct(sort_array(flatten(array(col("temp_c1"),col("temp_c2"),col("temp_c3"))))),'')).\
        drop(*['temp_c1','temp_c2','temp_c3'])
    
    df2.show(10,False)
    #+---+---+---+------------+
    #|c1 |c2 |c3 |c4          |
    #+---+---+---+------------+
    #|b  |f  |j  |abcdefghijkl|
    #|c  |g  |k  |abcdefghijkl|
    #|d  |h  |l  |abcdefghijkl|
    #|a  |e  |i  |abcdefghijkl|
    #+---+---+---+------------+