PySpark convert str to TimestampType

Hello my devoted coders,

First of all, I need you to know that I have tried many of the solutions you may find on the first page of your favorite search engine. It concerned the error:

TypeError: field dt: TimestampType can not accept object '2021-05-01T09:19:46' in type <class 'str'>

My data are stored in a Amazon S3 bucket as raw.csv and look like:

2021-05-01T09:19:46,...
2021-05-01T09:19:42,...
2021-05-01T09:19:39,...

I have tried:

from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext

df = GlueContext(SparkContext.getOrCreate()).create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={ 'paths': ["s3://bucket/to/raw.csv"] },
        format="csv",
        format_options={'withHeader': True}
    ).toDF()
events_schema = StructType([
    StructField("dt", TimestampType(), nullable=False),
    # and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss"))\
    .show(1, False)

And

df.withColumn("dt", unix_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")\
        .cast("double")
        .cast("timestamp"))\
    .show(1, False)

I still have the exact same error.

Solution

Try by reading dt as stringtype then use df.withColumn to cast to timestamptype

Example:

events_schema = StructType([
    StructField("dt", StringType(), nullable=False),
    # and many other columns
])

df = session.createDataFrame(df.rdd, schema=events_schema)
df.show(10,False)
#+-------------------+
#|dt                 |
#+-------------------+
#|2021-05-01T09:19:46|
#+-------------------+

df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")).show()
#+-------------------+
#|                 dt|
#+-------------------+
#|2021-05-01 09:19:46|
#+-------------------+