Search code examples
apache-sparkpysparkgroup-byapache-spark-sql

Add column to Pyspark which assign number of groups to regaridng rows


i have a dataframe:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('').getOrCreate()
df = spark.createDataFrame([("a", "65"), ("b", "23"),("c", "65"), ("d", "23"),
                        ("a", "66"), ("b", "46"),("c", "23"), ("d", "66"),
            ("b", "5"), ("b", "3"),("c", "3")], ["column2", "value"])
df.show()

+-------+-----+
|column2|value|
+-------+-----+ 
|      a| 65  |
|      b| 23  |
|      c| 65  |
|      d| 23  |
|      a| 66  |
|      b| 46  |
|      c| 23  |
|      d| 66  |
|      b|  5  |
|      b|  3  |
|      c|  3  |
+-------+-----+

And I wanted to make each 4 row as an one group. Then regarding to that group create new column where i can assign the number of groups to the corresponding rows. So the desired output is as following:

+-------+-----+------+
|column2|value|gr_val|
+-------+-----+ -----+
|      a| 65  |    1 |
|      b| 23  |    1 |
|      c| 65  |    1 |
|      d| 23  |    1 |
|      a| 66  |    2 |
|      b| 46  |    2 |
|      c| 23  |    2 |
|      d| 66  |    2 |
|      b|  5  |    3 |
|      b|  3  |    3 |
|      c|  3  |    3 |
+-------+-----+------+

I would appreciate any helps!


Solution

  • Try this approach -

    (1) Create a new column (dummy) that will hold sequentially increasing number to each row. lit('a') used to create static value to generate sequentially increasing row number.

    (2) Devide the dummy column with number or records you want in each group (eg. 4) and take ceil. Ceil return the smallest integer not less than the value.

    Here is detailed example -

    from pyspark.sql.functions import *
    from pyspark.sql.window import *
    
    w = Window().partitionBy(lit('a')).orderBy(lit('a'))
    
    df.withColumn("row_num", row_number().over(w))\
        .selectExpr('column2 AS column2','value AS value','ceil(row_num/4) as gr_val')\
        .show()
    
    #+-------+-----+------+
    #|column2|value|gr_val|
    #+-------+-----+------+
    #|      a|   65|     1|
    #|      b|   23|     1|
    #|      c|   65|     1|
    #|      d|   23|     1|
    #|      a|   66|     2|
    #|      b|   46|     2|
    #|      c|   23|     2|
    #|      d|   66|     2|
    #|      b|    5|     3|
    #|      b|    3|     3|
    #|      c|    3|     3|
    #+-------+-----+------+