Search code examples
amazon-web-servicespysparkapache-spark-sqlaws-glue

How to preserve the key letter case in AWS Glue Transform node?


I am trying to add a new column called timestamp in AWS Glue.

My upstream data keys have capital letters.

enter image description here

However, after adding the column timestamp, the keys of the remaining columns got lowercased.

Experiment 1: Transform - SQL Query

select 
    cast((Absolute_Time - 2082844800) * 1000 as bigint) as timestamp,
    *
from
    engine_demo;

However, it lowercases all my keys.

enter image description here

Experiment 2: Transform - Custom code

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    df = dfc.select(list(dfc.keys())[0]).toDF()
    
    from pyspark.sql import functions as F
    df = df.withColumn(
      "timestamp",
      ((F.col("Absolute_Time") - 2082844800) * 1000).cast("long")
    )
    dynamic_frame = DynamicFrame.fromDF(df, glueContext, "changed")
    return (DynamicFrameCollection({"result_dynf": dynamic_frame}, glueContext))

It also lowercases all my keys

enter image description here

Experiment 3

I found Enable case sensitivity for spark.sql globally

However, there is no place to set in "Transform - SQL Query" node, so I try to set the config in "Transform - Custom" code:

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    glueContext.spark_session.conf.set("spark.sql.caseSensitive", "true")
    
    df = dfc.select(list(dfc.keys())[0]).toDF()
    from pyspark.sql import functions as F
    df = df.withColumn(
      "timestamp",
      ((F.col("Absolute_Time") - 2082844800) * 1000).cast("long")
    )
    dynamic_frame = DynamicFrame.fromDF(df, glueContext, "changed")
    return (DynamicFrameCollection({"dynamic_frame": dynamic_frame}, glueContext))

However, this above code gives same result.

Is there a way to preserve the key letter case in the AWS Glue Transform node? Thanks!


Solution

  • Here is my experiment result, if I remove this Transform "convert time" step and sink to Delta Lake directly. It will preserve the letter case.

    However, as I am using Glue Crawler to register the Delta table in the Glue Catalog.

    Here is some info I got from Jay Naves from AWS team (thanks!):

    Glue catalog forces lowercase column names to be Hive compatible because the Hive metastore schema lowercases all column names.

    https://github.com/apache/hive/blob/9d81764788920851c4461827a9984dbf48fdf6db/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L909

    So even I preserved the case in Delta table. The downstream will still become lowercase.

    I am ending up with this solution:

    Since SQL query is designed to be case insensitive by default,

    1.

    select Absolute_Time from engine_demo;
    

    will return Absolute_Time.

    2.

    select absolute_time from engine_demo;
    

    will return absolute_time

    3.

    select * from engine_demo;
    

    will return absolute_time.

    Most time, in dashboard, users can query using first way to have correct letter case.

    If any user needs to use select *, I can provide a dict to help map from lowercase to correct letter case.