Search code examples
sqlapache-sparkpysparkapache-spark-sqlwindow-functions

Using rangeBetween considering months rather than days in PySpark


I'm looking how to translate this chunk of SQL code into PySpark syntax.

SELECT MEAN(some_value) OVER (
    ORDER BY yyyy_mm_dd
    RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW
    ) AS mean
FROM
    df

If the above was a range expressed in days, this could easily have been done using something like

.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)

(See also ZygD's solution here: Spark Window Functions - rangeBetween dates)

For a range in months, this however doesn't work as the number of days in a month is not a constant. Any idea how to perform a range considering months using PySpark syntax?


Solution

  • You can "borrow" the full SQL column expression and use it in PySpark.

    Input:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [('2022-05-01', 1),
         ('2022-06-01', 2),
         ('2022-07-01', 3),
         ('2022-08-01', 4),
         ('2022-09-01', 5)],      
        ['yyyy_mm_dd', 'some_value']
    ).withColumn('yyyy_mm_dd', F.to_date('yyyy_mm_dd'))
    

    Script:

    df = df.withColumn('mean', F.expr("mean(some_value) over (order by yyyy_mm_dd range between interval 3 months preceding and current row)"))
    
    df.show()
    # +----------+----------+----+
    # |yyyy_mm_dd|some_value|mean|
    # +----------+----------+----+
    # |2022-05-01|         1| 1.0|
    # |2022-06-01|         2| 1.5|
    # |2022-07-01|         3| 2.0|
    # |2022-08-01|         4| 2.5|
    # |2022-09-01|         5| 3.5|
    # +----------+----------+----+