Search code examples
pythonpysparkbigdatadata-manipulation

How to combine multiple succeeding rows in pyspark?


I have a dataset, for example

Column A Column B Column C Column D
Cell A1 Cell B1 Cell C1 Cell D1
Cell A2 Cell B2 Cell C2 Cell D2
Cell A3 Cell B3 Cell C3 Cell D3
Cell A4 Cell B4 Cell C4 Cell D4

Is there a possibility that I can join n rows together. For example, row 1 and row 2 get joined together, while mainting the columns? So, I can get

Column A Column B Column C Column D
Cell A1, A2 Cell B1, B2 Cell C1, C2 Cell D1, D2
Cell A3, A4 Cell B3, B4 Cell C3, C4 Cell D3, D4

Solution

  • you can try with row_number(), monotonically_increasing_id().

    Create the groups by getting % on the row_number field then use sum window operation to get groups.

    Example:

    from pyspark.sql import *
    from pyspark.sql.functions import *
    window = Window.partitionBy(lit(1)).orderBy('Column A')
    window_mid = Window.partitionBy(lit(1)).orderBy('mid')
    
    df.withColumn("rn", row_number().over(window)).\
      withColumn("grp", (col("rn")%2)).\
        withColumn("mid", monotonically_increasing_id()).\
        withColumn("sum", sum(col("grp")).over(window_mid)).\
            groupBy("sum").agg(array_join(collect_list(col("Column A")),',').alias("column a"),
                               array_join(collect_list(col("Column B")),',').alias("column b"),
                               array_join(collect_list(col("Column C")),',').alias("column c"),
                               array_join(collect_list(col("Column D")),',').alias("column d")).\
              drop(*['rn','grp','mid','sum']).show()
    
    #+---------------+---------------+---------------+---------------+
    #|       column a|       column b|       column c|       column d|
    #+---------------+---------------+---------------+---------------+
    #|Cell A1,Cell A2|Cell B1,Cell B2|Cell C1,Cell C2|Cell D1,Cell D2|
    #|Cell A3,Cell A4|Cell B3,Cell B4|Cell C3,Cell C4|Cell D3,Cell D4|
    #+---------------+---------------+---------------+---------------+