Search code examples
apache-sparkpysparkapache-spark-sqlwindow-functions

Conditional forward fill values in Spark dataframe


I have a Spark dataframe where I need to create a window partition column ("desired_output"). I simply want this conditional column to equal the "flag" column (0) until the first true or 1 and then forward fill true or 1 forward throughout the partition ("user_id"). I've tried many different window partition variations (rowsBetween) but to no avail. Any help or guidance is greatly appreciated.

Here are some use cases and their "desired output":

columns = ['user_id', 'date', 'flag', 'desired_outcome']
data = [\
        ('1','2022-01-01', 0, 0),\
        ('1','2022-01-02', 0, 0),\
        ('1','2022-01-03', 0, 0),\
        ('1','2022-01-04', 0, 0),\
        ('1','2022-01-05', 1, 1),\
        ('1','2022-01-06', 0, 1),\
        ('1','2022-01-07', 0, 1),\
        ('2','2022-01-01', 0, 0),\
        ('2','2022-01-02', 0, 0),\
        ('2','2022-01-03', 0, 0),\
        ('2','2022-01-04', 0, 0),\
        ('2','2022-01-05', 0, 0),\
        ('2','2022-01-06', 1, 1),\
        ('2','2022-01-07', 0, 1),\
        ('2','2022-01-08', 0, 1),\
        ('2','2022-01-09', 0, 1),\
        ('2','2022-01-10', 0, 1),\
        ('2','2022-01-11', 0, 1),\
        ('2','2022-01-12', 0, 1)]

sample_df = spark.createDataFrame(data, columns)

Solution

  • We can solve this particular case (as there are just two flag values 0 and 1) by using a sliding max() window.

    data_sdf. \
        withColumn('so_answer',
                   func.max('flag').over(wd.partitionBy('user_id').orderBy('date').rowsBetween(-sys.maxsize, 0))
                   ). \
        show()
    
    # +-------+----------+----+---------------+---------+
    # |user_id|      date|flag|desired_outcome|so_answer|
    # +-------+----------+----+---------------+---------+
    # |      1|2022-01-01|   0|              0|        0|
    # |      1|2022-01-02|   0|              0|        0|
    # |      1|2022-01-03|   0|              0|        0|
    # |      1|2022-01-04|   0|              0|        0|
    # |      1|2022-01-05|   1|              1|        1|
    # |      1|2022-01-06|   0|              1|        1|
    # |      1|2022-01-07|   0|              1|        1|
    # |      2|2022-01-01|   0|              0|        0|
    # |      2|2022-01-02|   0|              0|        0|
    # |      2|2022-01-03|   0|              0|        0|
    # |      2|2022-01-04|   0|              0|        0|
    # |      2|2022-01-05|   0|              0|        0|
    # |      2|2022-01-06|   1|              1|        1|
    # |      2|2022-01-07|   0|              1|        1|
    # |      2|2022-01-08|   0|              1|        1|
    # |      2|2022-01-09|   0|              1|        1|
    # |      2|2022-01-10|   0|              1|        1|
    # |      2|2022-01-11|   0|              1|        1|
    # |      2|2022-01-12|   0|              1|        1|
    # +-------+----------+----+---------------+---------+