Search code examples
apache-sparkpysparkgroup-bywindow-functionsspark-window-function

PySpark group by with rolling window


Suppose I have a table with three columns: dt, id and value.

df_tmp = spark.createDataFrame([('2023-01-01', 1001, 5),
                                ('2023-01-15', 1001, 3),
                                ('2023-02-10', 1001, 1),
                                ('2023-02-20', 1001, 2),
                                ('2023-01-02', 1002, 7),
                                ('2023-01-02', 1002, 6),
                                ('2023-01-03', 1002, 1)],
                               ["date", "id", "value"])
df.show()
# +----------+----+-----+
# |      date|  id|value|
# +----------+----+-----+
# |2023-01-01|1001|    5|
# |2023-01-15|1001|    3|
# |2023-02-10|1001|    1|
# |2023-02-20|1001|    2|
# |2023-01-02|1002|    7|
# |2023-01-02|1002|    6|
# |2023-01-03|1002|    1|
# +----------+----+-----+

I would like to compute the 30-day rolling sum of value grouped by id for every date, and additionally, a number of distinct dates that the id was seen. Something that would look like this:

+----------+----+-----+----------------+-------------------------+
|      date|  id|value|30_day_value_sum|days_seen_in_past_30_days|
+----------+----+-----+----------------+-------------------------+
|2023-01-01|1001|    5|               0|                        0|
|2023-01-15|1001|    3|               0|                        1|
|2023-02-10|1001|    1|               3|                        1|
|2023-02-20|1001|    2|               1|                        2|
|2023-01-02|1002|    7|               0|                        0|
|2023-01-02|1002|    6|               7|                        1|
|2023-01-03|1002|    1|              13|                        2|
+----------+----+-----+----------------+-------------------------+

I suspect one could do it using Window but am not clear about the explicit details.


Solution

  • I assume that originally you have date data type in your dataframe, so I used a bit modified input than yours:

    from pyspark.sql import functions as F, Window as W
    df_tmp = spark.createDataFrame(
        [('2023-01-01', 1001, 5),
         ('2023-01-15', 1001, 3),
         ('2023-02-10', 1001, 1),
         ('2023-02-20', 1001, 2),
         ('2023-01-01', 1002, 7),
         ('2023-01-02', 1002, 6),
         ('2023-01-03', 1002, 1)],
        ["date", "id", "value"]
    ).withColumn('date', F.col('date').cast('date'))
    

    In this case, the following window should work:

    w = W.partitionBy('id').orderBy(F.expr("unix_date(date)")).rangeBetween(-30, -1)
    df = df_tmp.withColumn('30_day_value_sum', F.sum('value').over(w)) \
               .withColumn('days_seen_in_past_30_days', F.count('id').over(w))
    df = df.fillna(0, subset=['30_day_value_sum'])
    
    df.show()
    # +----------+----+-----+----------------+-------------------------+
    # |      date|  id|value|30_day_value_sum|days_seen_in_past_30_days|
    # +----------+----+-----+----------------+-------------------------+
    # |2023-01-01|1001|    5|               0|                        0|
    # |2023-01-15|1001|    3|               5|                        1|
    # |2023-02-10|1001|    1|               3|                        1|
    # |2023-02-20|1001|    2|               1|                        1|
    # |2023-01-01|1002|    7|               0|                        0|
    # |2023-01-02|1002|    6|               7|                        1|
    # |2023-01-03|1002|    1|              13|                        2|
    # +----------+----+-----+----------------+-------------------------+
    

    If you have string type in the "date" column, then you should use the following .orderBy` clause:

    .orderBy(F.expr("unix_date(to_date(date))")).rangeBetween(-30, -1)
    

    More options and details are here.