Hello my devoted coders,
First of all, I need you to know that I have tried many of the solutions you may find on the first page of your favorite search engine. It concerned the error:
TypeError: field dt: TimestampType can not accept object '2021-05-01T09:19:46' in type <class 'str'>
My data are stored in a Amazon S3 bucket as raw.csv
and look like:
2021-05-01T09:19:46,...
2021-05-01T09:19:42,...
2021-05-01T09:19:39,...
I have tried:
from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
df = GlueContext(SparkContext.getOrCreate()).create_dynamic_frame.from_options(
connection_type="s3",
connection_options={ 'paths': ["s3://bucket/to/raw.csv"] },
format="csv",
format_options={'withHeader': True}
).toDF()
events_schema = StructType([
StructField("dt", TimestampType(), nullable=False),
# and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss"))\
.show(1, False)
And
df.withColumn("dt", unix_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")\
.cast("double")
.cast("timestamp"))\
.show(1, False)
I still have the exact same error.
Try by reading dt
as stringtype then use df.withColumn
to cast to timestamptype
Example:
events_schema = StructType([
StructField("dt", StringType(), nullable=False),
# and many other columns
])
df = session.createDataFrame(df.rdd, schema=events_schema)
df.show(10,False)
#+-------------------+
#|dt |
#+-------------------+
#|2021-05-01T09:19:46|
#+-------------------+
df.withColumn("dt", to_timestamp("dt", "yyyy-MM-dd'T'HH:mm:ss")).show()
#+-------------------+
#| dt|
#+-------------------+
#|2021-05-01 09:19:46|
#+-------------------+