python pyspark azure-sql-database databricks azure-databricks

Why pyspark converting string date values to null?

Question: Why the myTimeStampCol1 in the following code is returning a null value in the third row, and how can we fix the issue?

from pyspark.sql.functions import *
df=spark.createDataFrame(data = [ ("1","Arpit","2021-07-24 12:01:19.000"),("2","Anand","2019-07-22 13:02:20.000"),("3","Mike","11-16-2021 18:00:08")],
schema=["id","Name","myTimeStampCol"])
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),"yyyy-MM-dd HH:mm:ss.SSSS").alias("myTimeStampCol1")).show()

Output

+--------------------+-------------------+
|myTimeStampCol      | myTimeStampCol1|
+--------------------+-------------------+
|2021-07-24 12:01:...|2021-07-24 12:01:19|
|2019-07-22 13:02:...|2019-07-22 13:02:20|
| 11-16-2021 18:00:08|               null|

Remarks:

I'm running the code in a python notebook in Azure Databricks (that is almost the same as Databricks)
Above example is just a sample to explain the issue. The real code is importing a data file with millions of records. And the file has a column that has the format MM-dd-yyyy HH:mm:ss (for example 11-16-2021 18:00:08) and all the values in that column have exact same format MM-dd-yyyy HH:mm:ss

Solution

The error occurs because of the difference in formats. Since all the records in this column are in the format MM-dd-yyyy HH:mm:ss, You can modify the code as following.

df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),'MM-dd-yyyy HH:mm:ss').alias("myTimeStampCol1")).show(truncate=False)

#only if all the records in this column are 'MM-dd-yyyy HH:mm:ss' format

to_timestamp() column expects either 1 or 2 arguments, a column with these timestamp values and the second is the format of these values. Since all these values are the same format MM-dd-yyyy HH:mm:ss, you can specify this as the second argument.
A sample output for this case is given in the below image: