scala group-by pyspark apache-spark-sql rdd

How to create dynamic group in PySpark dataframe?

Though problem is of creating multiple group on the basis of two or more column's values of consecutive row, I am just simplifying the problem this way. Suppose have pyspark dataframe like this

>>> df=sqlContext.createDataFrame([
... Row(SN=1,age=45, gender='M', name='Bob'),
... Row(SN=2,age=28, gender='M', name='Albert'),
... Row(SN=3,age=33, gender='F', name='Laura'),
... Row(SN=4,age=43, gender='F', name='Gloria'),
... Row(SN=5,age=18, gender='T', name='Simone'),
... Row(SN=6,age=45, gender='M', name='Alax'),
... Row(SN=7,age=28, gender='M', name='Robert')])
>>> df.show()

+---+---+------+------+
| SN|age|gender|  name|
+---+---+------+------+
|  1| 45|     M|   Bob|
|  2| 28|     M|Albert|
|  3| 33|     F| Laura|
|  4| 43|     F|Gloria|
|  5| 18|     T|Simone|
|  6| 45|     M|  Alax|
|  7| 28|     M|Robert|
+---+---+------+------+

Now I want to add "section" column that will have same value if the gender value in consecutive rows are matching, if gender change in next row section value get incremented. So to be precise, I want output like this

+---+---+------+------+-------+
| SN|age|gender|  name|section|
+---+---+------+------+-------+
|  1| 45|     M|   Bob|      1|
|  2| 28|     M|Albert|      1|
|  3| 33|     F| Laura|      2|
|  4| 43|     F|Gloria|      2|
|  5| 18|     T|Simone|      3|
|  6| 45|     M|  Alax|      4|
|  7| 28|     M|Robert|      4|
+---+---+------+------+-------+

Solution

Unclear if you're looking for Python or Scala solutions, but they would be pretty similar - so here's a Scala solution using Window Functions:

import spark.implicits._
import functions._

// we'll use this window to attach the "previous" gender to each record
val globalWindow = Window.orderBy("SN")

// we'll use this window to compute "cumulative sum" of 
// an indicator column that would be 1 only if gender changed
val upToThisRowWindow = globalWindow.rowsBetween(Long.MinValue, 0)

val result = df
  .withColumn("prevGender", lag("gender", 1) over globalWindow) // add previous record's gender
  .withColumn("shouldIncrease", when($"prevGender" =!= $"gender", 1) otherwise 0) // translate to 1 or 0
  .withColumn("section", (sum("shouldIncrease") over upToThisRowWindow) + lit(1)) // cumulative sum
  .drop("prevGender", "shouldIncrease") // drop helper columns

result.show()
// +---+---+------+------+-------+
// | SN|age|gender|  name|section|
// +---+---+------+------+-------+
// |  1| 45|     M|   Bob|      1|
// |  2| 28|     M|Albert|      1|
// |  3| 33|     F| Laura|      2|
// |  4| 43|     F|Gloria|      2|
// |  5| 18|     T|Simone|      3|
// |  6| 45|     M|  Alax|      4|
// |  7| 28|     M|Robert|      4|
// +---+---+------+------+-------+

And following is the equivalent pyspark code

from pyspark.sql import Window as W
import sys
globalWindow = W.orderBy("SN")
upToThisRowWindow = globalWindow.rowsBetween(-sys.maxsize-1, 0)
from pyspark.sql import functions as F
df.withColumn("section", F.sum(F.when(F.lag("gender", 1).over(globalWindow) != df.gender, 1).otherwise(0)).over(upToThisRowWindow)+1).show()