Search code examples
python-3.xdataframeazurepysparkazure-databricks

How to find the date difference excluding weekends using PySpark without using UDF


Assume we have a dataframe as follows df = spark.createDataFrame([ ("2023-10-26",), ("2022-09-01",), ("2021-08-10",), ], ["start_date"]) df.show()

i want to generate a new column with a new date obtained by subtracting or adding a specific days to those "start_date" in my df while excluding the weekends(Assume Saturday and Sundays) and this needs to be done without making use of the UDF. In other words i am trying to replicate "firstworkdate" function part of Qlik using PySpark

DBR version - 12.2 LTS,
Spark version - 3.3.2
Azure Databricks Environment

I know this can be done using UDF but i dont want to make use of UDF as its is not supported in Unity catalouge
Session specific UDF definition also does not work in my case


Solution

  • LOGIC:

    Here is a way to find the FirstWorkDay. Lets try with 2023-12-16 (Saturday) as End_Date and 10 as day_num:

    1. Check if the End_Date is weekday or not. If it is a weekend then our End_Date should be the closest Friday, i.e 2023-12-15.
    2. Get the date of the Monday that just passed, i.e 2023-12-11.
    3. Get the diff between Recent_Monday and End_Date, i.e 5
    4. Subtract day_num and diff = 10-5 = 5
    5. Find the week difference between End_day and FirstWorkDay: ceil(Subtracted_days/5), i.e ceil(5/5) = 1
    6. Find the number of days between End_day and FirstWorkDay: Subtracted_days + ceil_num2 = 5 + 12 = 7
    7. So we need to subtract 7 days from Recent_Monday to get our answer. 2023-12-11 - 7 days = 2023-12-04.

    Note: We follow the above steps if the day_num > week_number (Mon: 0, Tue: 1, Wed: 2....). Else we can directly subtract day_num from End_Date.

    Implementation:

    def firstWorkDay(df, date_col, num_date_col):
    """
    Purpose: Get the First Working Day, given an end date and number of working days to in between. 
    
    Parameters: df -> Dataframe: The dataframe to be passed with date(end date) and num_date(number of working days) columns.
                date_col -> str: Name of the date(end date) column
                num_date_col -> str: Name of the num_date(number of working days) column
    
    Returns: A dataframe with column "firstWorkDay" added
    """
    
    df = calc_dayofweek_from_monday(df, date_col, "num_day")
    # Check if the date is weekday or not. If it is a weekend then our new_date should be the closest Friday
    df = df.withColumn(
        "new_date",
        when(
            date_format(date_col, "E").isin(["Sat", "Sun"]),
            date_sub(
                col(date_col), when(date_format(date_col, "E") == "Sat", 1).otherwise(2)
            ),
        ).otherwise(col(date_col)),
    )
    # Re-calculate the dayofweek for new_date
    df = calc_dayofweek_from_monday(df, "new_date", "new_num_day")
    df = df.withColumn(
        "nearest_monday", date_sub(col("new_date"), col("new_num_day") - 1)
    )
    df = df.withColumn("day_diff_from_mon", col(num_date_col) - col("new_num_day"))
    # We would need to accomodate the day_diff_from_mon in weeks_req 
    df = df.withColumn("weeks_req", (ceil(col("day_diff_from_mon") / 5)))
    # Accounting the Sat, Sun present in a week so that we can directly subtract days_sub_till_mon days from nearest_monday and get the FirstWorkDay
    df = df.withColumn(
        "days_sub_till_mon", col("day_diff_from_mon") + col("weeks_req") * 2
    )
    # We follow the above steps if the num_date_col > new_num_day (week_number (Mon: 0, Tue: 1, Wed: 2....)). Else we can directly subtract day_num from new_date
    df = df.withColumn(
        "firstWorkDay",
        when(
            col(num_date_col) > col("new_num_day"),
            date_sub(
                col("nearest_monday"), col("days_sub_till_mon").cast(IntegerType())
            ),
        ).otherwise(
            date_sub(col("new_date"), col(num_date_col).cast(IntegerType()) - 1)
        ),
    )
    
    return df.select(date_col, num_date_col, "firstWorkDay")
    

    Note: In my case, I needed the week to start on Monday hence I have created a custom function called "calc_dayofweek_from_monday" but default "dayofweek" can also be used.