Search code examples
scalagroup-bypysparkapache-spark-sqlrdd

How to create dynamic group in PySpark dataframe?


Though problem is of creating multiple group on the basis of two or more column's values of consecutive row, I am just simplifying the problem this way. Suppose have pyspark dataframe like this

>>> df=sqlContext.createDataFrame([
... Row(SN=1,age=45, gender='M', name='Bob'),
... Row(SN=2,age=28, gender='M', name='Albert'),
... Row(SN=3,age=33, gender='F', name='Laura'),
... Row(SN=4,age=43, gender='F', name='Gloria'),
... Row(SN=5,age=18, gender='T', name='Simone'),
... Row(SN=6,age=45, gender='M', name='Alax'),
... Row(SN=7,age=28, gender='M', name='Robert')])
>>> df.show()

+---+---+------+------+
| SN|age|gender|  name|
+---+---+------+------+
|  1| 45|     M|   Bob|
|  2| 28|     M|Albert|
|  3| 33|     F| Laura|
|  4| 43|     F|Gloria|
|  5| 18|     T|Simone|
|  6| 45|     M|  Alax|
|  7| 28|     M|Robert|
+---+---+------+------+

Now I want to add "section" column that will have same value if the gender value in consecutive rows are matching, if gender change in next row section value get incremented. So to be precise, I want output like this

+---+---+------+------+-------+
| SN|age|gender|  name|section|
+---+---+------+------+-------+
|  1| 45|     M|   Bob|      1|
|  2| 28|     M|Albert|      1|
|  3| 33|     F| Laura|      2|
|  4| 43|     F|Gloria|      2|
|  5| 18|     T|Simone|      3|
|  6| 45|     M|  Alax|      4|
|  7| 28|     M|Robert|      4|
+---+---+------+------+-------+

Solution

  • Unclear if you're looking for Python or Scala solutions, but they would be pretty similar - so here's a Scala solution using Window Functions:

    import spark.implicits._
    import functions._
    
    // we'll use this window to attach the "previous" gender to each record
    val globalWindow = Window.orderBy("SN")
    
    // we'll use this window to compute "cumulative sum" of 
    // an indicator column that would be 1 only if gender changed
    val upToThisRowWindow = globalWindow.rowsBetween(Long.MinValue, 0)
    
    val result = df
      .withColumn("prevGender", lag("gender", 1) over globalWindow) // add previous record's gender
      .withColumn("shouldIncrease", when($"prevGender" =!= $"gender", 1) otherwise 0) // translate to 1 or 0
      .withColumn("section", (sum("shouldIncrease") over upToThisRowWindow) + lit(1)) // cumulative sum
      .drop("prevGender", "shouldIncrease") // drop helper columns
    
    result.show()
    // +---+---+------+------+-------+
    // | SN|age|gender|  name|section|
    // +---+---+------+------+-------+
    // |  1| 45|     M|   Bob|      1|
    // |  2| 28|     M|Albert|      1|
    // |  3| 33|     F| Laura|      2|
    // |  4| 43|     F|Gloria|      2|
    // |  5| 18|     T|Simone|      3|
    // |  6| 45|     M|  Alax|      4|
    // |  7| 28|     M|Robert|      4|
    // +---+---+------+------+-------+
    

    And following is the equivalent pyspark code

    from pyspark.sql import Window as W
    import sys
    globalWindow = W.orderBy("SN")
    upToThisRowWindow = globalWindow.rowsBetween(-sys.maxsize-1, 0)
    from pyspark.sql import functions as F
    df.withColumn("section", F.sum(F.when(F.lag("gender", 1).over(globalWindow) != df.gender, 1).otherwise(0)).over(upToThisRowWindow)+1).show()