I am trying to add a new column called timestamp
in AWS Glue.
My upstream data keys have capital letters.
However, after adding the column timestamp
, the keys of the remaining columns got lowercased.
select
cast((Absolute_Time - 2082844800) * 1000 as bigint) as timestamp,
*
from
engine_demo;
However, it lowercases all my keys.
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
from pyspark.sql import functions as F
df = df.withColumn(
"timestamp",
((F.col("Absolute_Time") - 2082844800) * 1000).cast("long")
)
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "changed")
return (DynamicFrameCollection({"result_dynf": dynamic_frame}, glueContext))
It also lowercases all my keys
I found Enable case sensitivity for spark.sql globally
However, there is no place to set in "Transform - SQL Query" node, so I try to set the config in "Transform - Custom" code:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
glueContext.spark_session.conf.set("spark.sql.caseSensitive", "true")
df = dfc.select(list(dfc.keys())[0]).toDF()
from pyspark.sql import functions as F
df = df.withColumn(
"timestamp",
((F.col("Absolute_Time") - 2082844800) * 1000).cast("long")
)
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "changed")
return (DynamicFrameCollection({"dynamic_frame": dynamic_frame}, glueContext))
However, this above code gives same result.
Is there a way to preserve the key letter case in the AWS Glue Transform node? Thanks!
Here is my experiment result, if I remove this Transform "convert time" step and sink to Delta Lake directly. It will preserve the letter case.
However, as I am using Glue Crawler to register the Delta table in the Glue Catalog.
Here is some info I got from Jay Naves from AWS team (thanks!):
Glue catalog forces lowercase column names to be Hive compatible because the Hive metastore schema lowercases all column names.
So even I preserved the case in Delta table. The downstream will still become lowercase.
I am ending up with this solution:
Since SQL query is designed to be case insensitive by default,
1.
select Absolute_Time from engine_demo;
will return Absolute_Time
.
2.
select absolute_time from engine_demo;
will return absolute_time
3.
select * from engine_demo;
will return absolute_time
.
Most time, in dashboard, users can query using first way to have correct letter case.
If any user needs to use select *
, I can provide a dict to help map from lowercase to correct letter case.