Search code examples
apache-sparkpysparkapache-spark-sql

How to convert string type to timestamp in pyspark?


I am struggling to convert the string type based into timestamp as below.

+--------------------+
|              mydate|
+--------------------+
|26/Feb/2023:13:58:40|
|26/Feb/2023:13:30:33|
|26/Feb/2023:13:52:50|
|26/Feb/2023:13:47:09|
|26/Feb/2023:13:30:33|
|26/Feb/2023:13:14:28|
|26/Feb/2023:13:11:42|
|26/Feb/2023:13:34:03|
|26/Feb/2023:13:50:43|
|26/Feb/2023:13:10:47|
|26/Feb/2023:13:28:09|
|26/Feb/2023:13:30:16|
|26/Feb/2023:13:19:07|
|26/Feb/2023:13:30:24|
|26/Feb/2023:13:30:16|
|26/Feb/2023:13:05:37|
|26/Feb/2023:13:09:24|
|26/Feb/2023:13:24:18|
|26/Feb/2023:13:49:13|
|26/Feb/2023:13:56:40|
+--------------------+

I have the string type as below and I found the some codes that makes it converted to the time stamp. My pyspark code is as below.

wt.select('mydate').show()
wt.select(to_timestamp(lit('mydate'),"dd/MMM/yyyy:HH:mm:ss")).show()

But the results are empty even though I tried many times.

+----------------------------------------------+
|to_timestamp('mydate', 'dd/MMM/yyyy:HH:mm:ss')|
+----------------------------------------------+
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
|                                          null|
+----------------------------------------------+

Any help will be appreciated. Thanks.


Solution

  • The code you have is almost correct.

    If you have a dataframe with timestamps in string.

    +--------------------+
    |             strDate|
    +--------------------+
    |26/Feb/2023:13:30:16|
    |26/Feb/2023:13:05:37|
    +--------------------+
    

    You convert the column of 'strDate' to the given format.

    from pyspark.sql import functions as F
    
    res = df.select(F.to_timestamp(F.col('strDate'),"dd/MMM/yyyy:HH:mm:ss")).show()
    res.show()
    

    Yields

    +-------------------------------------------+
    |to_timestamp(strDate, dd/MMM/yyyy:HH:mm:ss)|
    +-------------------------------------------+
    |                        2023-02-26 13:30:16|
    |                        2023-02-26 13:05:37|
    +-------------------------------------------+
    

    We can verify the datatype with res.dtypes

    res.dtypes
    
    Out[28]: [('to_timestamp(strDate, dd/MMM/yyyy:HH:mm:ss)', 'timestamp')]