Search code examples
dataframeazurepysparkazure-databricks

Pyspark databricks: how to format yyyymmdd column to show as mm/dd/yyyy


in Azure databricks, pyspark there is already an existing dataframe

df1

Name date
A 20210720
B 20231005
C 20190215

I hope the column date can be format to like below

Name date
A 07/20/2021
B 10/05/2023
C 02/15/2019

How should I write the scripts??

Thank you


Solution

  • you can code something like the below, here am assuming the date column is of string type

    from pyspark.sql.functions import date_format, to_date
    
    
    data = [("20231030",), ("20231115",), ("20231225",)]
    columns = ["date_string"]
    
    df = spark.createDataFrame(data, columns)
    df = df.withColumn("to_date_format", to_date(df["date_string"], "yyyyMMdd"))
    df = df.withColumn("formatted_date", date_format(df["to_date_format"], "dd/MM/yyyy"))
    
    df.show()
    df.printSchema()
    

    output:

    >>> df.show()
    +-----------+--------------+--------------+
    |date_string|to_date_format|formatted_date|
    +-----------+--------------+--------------+
    |   20231030|    2023-10-30|    30/10/2023|
    |   20231115|    2023-11-15|    15/11/2023|
    |   20231225|    2023-12-25|    25/12/2023|
    +-----------+--------------+--------------+
    
    >>> df.printSchema()
    root
     |-- date_string: string (nullable = true)
     |-- to_date_format: date (nullable = true)
     |-- formatted_date: string (nullable = true)