Different date format & reverse date

I have string type date data like the below;

date(String)	new date(String)
6/23/2024	2024-06-23
1/8/2023	2023-08-01
2024-05-20	2024-05-20

how can I change from date(string) column to new date(string) format using pyspark? They both are string types

Solution

Basically, you can create a list of all the formats which could be expected in the date column and then choose the first non-null one. You can also determine the order in case of ambiguous dates

e.g formats = ["d/M/yyyy", "M/d/yyyy", "yyyy-MM-dd"]

"6/23/2024"  # assuming 6 is month and 23 is date since 23 can't be a month
"1/8/2023"  # this means 1 is month and 8 is date
To handle the output, we have put `"d/M/yyyy", "M/d/yyyy"` in order, so 
the first format is selected first. But this will be ambiguous 
if the M and d values matches for the first format and doesn't produce a null value.

Script below :

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, date_format, coalesce

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("6/23/2024",),  # assuming 6 is month and 23 is date since 23 can't be a month
        ("1/8/2023",),  # this means 1 is month and 8 is date
        ("2024-05-20",)]
columns = ["date"]
df = spark.createDataFrame(data, columns)


def standardize_date(df, input_col):
    formats = ["M/d/yyyy", "yyyy-MM-dd"]
    date_cols = [to_date(col(input_col), f).alias(f"parsed_{f}") for f in formats]
    df = df.select("*", *date_cols)
    standardized_col = coalesce(*[col(f"parsed_{f}") for f in formats])
    df = df.withColumn("new_date", date_format(standardized_col, "yyyy-MM-dd"))
    df = df.select(input_col, "new_date")
    return df


result_df = standardize_date(df, "date")
result_df.show()

Output :

+----------+----------+
|      date|  new_date|
+----------+----------+
| 6/23/2024|2024-06-23|
|  1/8/2023|2023-08-01|
|2024-05-20|2024-05-20|
+----------+----------+