I have a dataset like this :
C1 | C2 | C3 |
---|---|---|
a | e | i |
b | f | j |
c | g | k |
d | h | l |
I want to obtain this :
C1 | C2 | C3 | C4 |
---|---|---|---|
a | e | i | a b c d e f g h i j k l |
b | f | j | a b c d e f g h i j k l |
c | g | k | a b c d e f g h i j k l |
d | h | l | a b c d e f g h i j k l |
Do you know how to do this in python Pyspark ? Thanks !
Try with window
function and then use array_sort,flatten
functions with collect_set
.
Example:
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df2= df.withColumn("temp_c1",collect_set(col("c1")).over(w))\
.withColumn("temp_c2",collect_set(col("c2")).over(w))\
.withColumn("temp_c3",collect_set(col("c3")).over(w))\
.withColumn("c4",array_join(array_distinct(sort_array(flatten(array(col("temp_c1"),col("temp_c2"),col("temp_c3"))))),'')).\
drop(*['temp_c1','temp_c2','temp_c3'])
df2.show(10,False)
#+---+---+---+------------+
#|c1 |c2 |c3 |c4 |
#+---+---+---+------------+
#|b |f |j |abcdefghijkl|
#|c |g |k |abcdefghijkl|
#|d |h |l |abcdefghijkl|
#|a |e |i |abcdefghijkl|
#+---+---+---+------------+