python pyspark bigdata data-manipulation

How to combine multiple succeeding rows in pyspark?

I have a dataset, for example

Column A	Column B	Column C	Column D
Cell A1	Cell B1	Cell C1	Cell D1
Cell A2	Cell B2	Cell C2	Cell D2
Cell A3	Cell B3	Cell C3	Cell D3
Cell A4	Cell B4	Cell C4	Cell D4

Is there a possibility that I can join n rows together. For example, row 1 and row 2 get joined together, while mainting the columns? So, I can get

Column A	Column B	Column C	Column D
Cell A1, A2	Cell B1, B2	Cell C1, C2	Cell D1, D2
Cell A3, A4	Cell B3, B4	Cell C3, C4	Cell D3, D4

Solution

you can try with row_number(), monotonically_increasing_id().

Create the groups by getting % on the row_number field then use sum window operation to get groups.

Example:

from pyspark.sql import *
from pyspark.sql.functions import *
window = Window.partitionBy(lit(1)).orderBy('Column A')
window_mid = Window.partitionBy(lit(1)).orderBy('mid')

df.withColumn("rn", row_number().over(window)).\
  withColumn("grp", (col("rn")%2)).\
    withColumn("mid", monotonically_increasing_id()).\
    withColumn("sum", sum(col("grp")).over(window_mid)).\
        groupBy("sum").agg(array_join(collect_list(col("Column A")),',').alias("column a"),
                           array_join(collect_list(col("Column B")),',').alias("column b"),
                           array_join(collect_list(col("Column C")),',').alias("column c"),
                           array_join(collect_list(col("Column D")),',').alias("column d")).\
          drop(*['rn','grp','mid','sum']).show()

#+---------------+---------------+---------------+---------------+
#|       column a|       column b|       column c|       column d|
#+---------------+---------------+---------------+---------------+
#|Cell A1,Cell A2|Cell B1,Cell B2|Cell C1,Cell C2|Cell D1,Cell D2|
#|Cell A3,Cell A4|Cell B3,Cell B4|Cell C3,Cell C4|Cell D3,Cell D4|
#+---------------+---------------+---------------+---------------+