I have string type date data like the below;
date(String) | new date(String) |
---|---|
6/23/2024 | 2024-06-23 |
1/8/2023 | 2023-08-01 |
2024-05-20 | 2024-05-20 |
how can I change from date(string) column to new date(string) format using pyspark? They both are string types
Basically, you can create a list of all the formats which could be expected in the date column and then choose the first non-null one. You can also determine the order in case of ambiguous dates
e.g formats = ["d/M/yyyy", "M/d/yyyy", "yyyy-MM-dd"]
"6/23/2024" # assuming 6 is month and 23 is date since 23 can't be a month
"1/8/2023" # this means 1 is month and 8 is date
To handle the output, we have put `"d/M/yyyy", "M/d/yyyy"` in order, so
the first format is selected first. But this will be ambiguous
if the M and d values matches for the first format and doesn't produce a null value.
Script below :
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, date_format, coalesce
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("6/23/2024",), # assuming 6 is month and 23 is date since 23 can't be a month
("1/8/2023",), # this means 1 is month and 8 is date
("2024-05-20",)]
columns = ["date"]
df = spark.createDataFrame(data, columns)
def standardize_date(df, input_col):
formats = ["M/d/yyyy", "yyyy-MM-dd"]
date_cols = [to_date(col(input_col), f).alias(f"parsed_{f}") for f in formats]
df = df.select("*", *date_cols)
standardized_col = coalesce(*[col(f"parsed_{f}") for f in formats])
df = df.withColumn("new_date", date_format(standardized_col, "yyyy-MM-dd"))
df = df.select(input_col, "new_date")
return df
result_df = standardize_date(df, "date")
result_df.show()
Output :
+----------+----------+
| date| new_date|
+----------+----------+
| 6/23/2024|2024-06-23|
| 1/8/2023|2023-08-01|
|2024-05-20|2024-05-20|
+----------+----------+