apache-spark-sql databricks azure-databricks aws-databricks

What is expected input date pattern for date_format function in databricks spark SQL

I am trying to better understand the date_format function offered by Spark SQL.As per the official databricks documentation (I am using databricks), this function expects any date/ string in a valid datetime format. Below is the link for the same.

I am finding it difficult to understand what is the exact definition of "valid" here. I am trying to understand the functionality through two examples here. Input string in YYYY-MM-DD format (2021-07-09), for which I get the expected results correctly:

Input string in DD-MM-YYYY format (20-07-2021), and I get null:

Why is this happening? How did this function understand that the parameter that I am passing is indeed in YYYY-MM-DD format? It could also have been YYYY-DD-MM.

My requirement is that I implement a logic that could handle all kinds of valid date formats (MM-DD-YYYY, YYYY-MM-DD, DD-MM-YYYY) and format the dates accordingly.

Solution

The following is valid input and output formats for ANSI date/time data types:

Example: ANSIDATE yyyy-mm-dd 2007-02-28 TIME WITH TIME ZONE hh:mm:ss.ffff... [+|-]th:tm

The valid range of time zone offset is from -14:00 to +14:00. date complies with the ANSI SQL standard definition for the Gregorian calendar: "NOTE 85 - Datetime data types will allow dates in the Gregorian format to be stored in the date range 0001-01-01 CE through 9999-12-31 CE

See Databricks SQL datetime patterns for details on valid formats. The function checks that the resulting dates are valid dates in the Proleptic Gregorian calendar, otherwise it returns NULL

When you use "20-07-2021" it does not conform to "yyyy-mm-dd" so results in NULL

Alternately, you can use make_date function which Creates a date from year, month, and day fields. Or better use to_date function

select date_format(to_date('9/15/2021', 'MM/dd/yyyy'), 'yyyy/MM/dd')

See Datetime Patterns for Formatting and Parsing in Spark.