dataframe azure pyspark azure-databricks

Pyspark databricks: how to format yyyymmdd column to show as mm/dd/yyyy

in Azure databricks, pyspark there is already an existing dataframe

df1

Name	date
A	20210720
B	20231005
C	20190215

I hope the column date can be format to like below

Name	date
A	07/20/2021
B	10/05/2023
C	02/15/2019

How should I write the scripts??

Thank you

Solution

you can code something like the below, here am assuming the date column is of string type

from pyspark.sql.functions import date_format, to_date


data = [("20231030",), ("20231115",), ("20231225",)]
columns = ["date_string"]

df = spark.createDataFrame(data, columns)
df = df.withColumn("to_date_format", to_date(df["date_string"], "yyyyMMdd"))
df = df.withColumn("formatted_date", date_format(df["to_date_format"], "dd/MM/yyyy"))

df.show()
df.printSchema()

output:

>>> df.show()
+-----------+--------------+--------------+
|date_string|to_date_format|formatted_date|
+-----------+--------------+--------------+
|   20231030|    2023-10-30|    30/10/2023|
|   20231115|    2023-11-15|    15/11/2023|
|   20231225|    2023-12-25|    25/12/2023|
+-----------+--------------+--------------+

>>> df.printSchema()
root
 |-- date_string: string (nullable = true)
 |-- to_date_format: date (nullable = true)
 |-- formatted_date: string (nullable = true)