Question: Why the myTimeStampCol1
in the following code is returning a null value in the third row, and how can we fix the issue?
from pyspark.sql.functions import *
df=spark.createDataFrame(data = [ ("1","Arpit","2021-07-24 12:01:19.000"),("2","Anand","2019-07-22 13:02:20.000"),("3","Mike","11-16-2021 18:00:08")],
schema=["id","Name","myTimeStampCol"])
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),"yyyy-MM-dd HH:mm:ss.SSSS").alias("myTimeStampCol1")).show()
Output
+--------------------+-------------------+
|myTimeStampCol | myTimeStampCol1|
+--------------------+-------------------+
|2021-07-24 12:01:...|2021-07-24 12:01:19|
|2019-07-22 13:02:...|2019-07-22 13:02:20|
| 11-16-2021 18:00:08| null|
Remarks:
Azure Databricks
(that is almost the same as Databricks
)MM-dd-yyyy HH:mm:ss
(for example 11-16-2021 18:00:08
) and all the values in that column have exact same format MM-dd-yyyy HH:mm:ss
The error occurs because of the difference in formats. Since all the records in this column are in the format MM-dd-yyyy HH:mm:ss
, You can modify the code as following.
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),'MM-dd-yyyy HH:mm:ss').alias("myTimeStampCol1")).show(truncate=False)
#only if all the records in this column are 'MM-dd-yyyy HH:mm:ss' format
to_timestamp() column expects either 1 or 2 arguments, a column with these timestamp values and the second is the format of these values. Since all these values are the same format MM-dd-yyyy HH:mm:ss
, you can specify this as the second argument.
A sample output for this case is given in the below image: