Assume we have a dataframe as follows
df = spark.createDataFrame([ ("2023-10-26",), ("2022-09-01",), ("2021-08-10",), ], ["start_date"]) df.show()
i want to generate a new column with a new date obtained by subtracting or adding a specific days to those "start_date" in my df while excluding the weekends(Assume Saturday and Sundays) and this needs to be done without making use of the UDF. In other words i am trying to replicate "firstworkdate" function part of Qlik using PySpark
DBR version - 12.2 LTS,
Spark version - 3.3.2
Azure Databricks Environment
I know this can be done using UDF but i dont want to make use of UDF as its is not supported in Unity catalouge
Session specific UDF definition also does not work in my case
LOGIC:
Here is a way to find the FirstWorkDay. Lets try with 2023-12-16 (Saturday) as End_Date and 10 as day_num:
Note: We follow the above steps if the day_num > week_number (Mon: 0, Tue: 1, Wed: 2....). Else we can directly subtract day_num from End_Date.
Implementation:
def firstWorkDay(df, date_col, num_date_col):
"""
Purpose: Get the First Working Day, given an end date and number of working days to in between.
Parameters: df -> Dataframe: The dataframe to be passed with date(end date) and num_date(number of working days) columns.
date_col -> str: Name of the date(end date) column
num_date_col -> str: Name of the num_date(number of working days) column
Returns: A dataframe with column "firstWorkDay" added
"""
df = calc_dayofweek_from_monday(df, date_col, "num_day")
# Check if the date is weekday or not. If it is a weekend then our new_date should be the closest Friday
df = df.withColumn(
"new_date",
when(
date_format(date_col, "E").isin(["Sat", "Sun"]),
date_sub(
col(date_col), when(date_format(date_col, "E") == "Sat", 1).otherwise(2)
),
).otherwise(col(date_col)),
)
# Re-calculate the dayofweek for new_date
df = calc_dayofweek_from_monday(df, "new_date", "new_num_day")
df = df.withColumn(
"nearest_monday", date_sub(col("new_date"), col("new_num_day") - 1)
)
df = df.withColumn("day_diff_from_mon", col(num_date_col) - col("new_num_day"))
# We would need to accomodate the day_diff_from_mon in weeks_req
df = df.withColumn("weeks_req", (ceil(col("day_diff_from_mon") / 5)))
# Accounting the Sat, Sun present in a week so that we can directly subtract days_sub_till_mon days from nearest_monday and get the FirstWorkDay
df = df.withColumn(
"days_sub_till_mon", col("day_diff_from_mon") + col("weeks_req") * 2
)
# We follow the above steps if the num_date_col > new_num_day (week_number (Mon: 0, Tue: 1, Wed: 2....)). Else we can directly subtract day_num from new_date
df = df.withColumn(
"firstWorkDay",
when(
col(num_date_col) > col("new_num_day"),
date_sub(
col("nearest_monday"), col("days_sub_till_mon").cast(IntegerType())
),
).otherwise(
date_sub(col("new_date"), col(num_date_col).cast(IntegerType()) - 1)
),
)
return df.select(date_col, num_date_col, "firstWorkDay")
Note: In my case, I needed the week to start on Monday hence I have created a custom function called "calc_dayofweek_from_monday" but default "dayofweek" can also be used.