I have a dataset, for example
Column A | Column B | Column C | Column D |
---|---|---|---|
Cell A1 | Cell B1 | Cell C1 | Cell D1 |
Cell A2 | Cell B2 | Cell C2 | Cell D2 |
Cell A3 | Cell B3 | Cell C3 | Cell D3 |
Cell A4 | Cell B4 | Cell C4 | Cell D4 |
Is there a possibility that I can join n rows together. For example, row 1 and row 2 get joined together, while mainting the columns? So, I can get
Column A | Column B | Column C | Column D |
---|---|---|---|
Cell A1, A2 | Cell B1, B2 | Cell C1, C2 | Cell D1, D2 |
Cell A3, A4 | Cell B3, B4 | Cell C3, C4 | Cell D3, D4 |
you can try with row_number(), monotonically_increasing_id()
.
Create the groups by getting %
on the row_number
field then use sum
window operation to get groups.
Example:
from pyspark.sql import *
from pyspark.sql.functions import *
window = Window.partitionBy(lit(1)).orderBy('Column A')
window_mid = Window.partitionBy(lit(1)).orderBy('mid')
df.withColumn("rn", row_number().over(window)).\
withColumn("grp", (col("rn")%2)).\
withColumn("mid", monotonically_increasing_id()).\
withColumn("sum", sum(col("grp")).over(window_mid)).\
groupBy("sum").agg(array_join(collect_list(col("Column A")),',').alias("column a"),
array_join(collect_list(col("Column B")),',').alias("column b"),
array_join(collect_list(col("Column C")),',').alias("column c"),
array_join(collect_list(col("Column D")),',').alias("column d")).\
drop(*['rn','grp','mid','sum']).show()
#+---------------+---------------+---------------+---------------+
#| column a| column b| column c| column d|
#+---------------+---------------+---------------+---------------+
#|Cell A1,Cell A2|Cell B1,Cell B2|Cell C1,Cell C2|Cell D1,Cell D2|
#|Cell A3,Cell A4|Cell B3,Cell B4|Cell C3,Cell C4|Cell D3,Cell D4|
#+---------------+---------------+---------------+---------------+