Search code examples
pythonapache-spark-sqlaws-gluepyspark

How to determine what are the columns I need since ApplyMapping is'nt case sensitive?


I'm updating a Pyspark script with a new Database model and I've encountered some problems calling/updating columns since PySpark apparently brings all columns in uppercase but when I use ApplyMapping it is not case sensitive BUT when I join(By left) with another table it is case sensitive and I end up with multiple columns with the same name but one of them in uppercase and the other one in lowercase and I want to use SelectFields function.

I've tried the exact same name (Case sensitive) of the columns but it always brings the same.

I've tried printing the schema but the only difference is only the case.

testDF = testDF.join(test2DF, "COLUMN",how='left')

test3DF=test3DF.withColumn("column", test3DF['ATTRB'].substr(4,4))

fullDF= testDF.join(test3DF, (testDF['ID'] == test3DF['ID']) )

.....

applymappingTest = ApplyMapping.apply(frame = fullDF, mappings = [
    ('ID', 'string', 'identifier', 'string')
    ('column', 'string', 'myColumn', 'string')
    ('COLUMN', 'string', 'someother', 'string')
    ], transformation_ctx = "applymappingTest")

......

selectfieldsTest= SelectFields.apply(frame = applymappingTest, paths = [
     "identifier",
     "myColumn",
], transformation_ctx = "selectfieldsTest")
Expected result:
myColumn is the column with the name in lowercase.
Actual result: 
myColumn is the column with the name in uppercase.

Solution

  • You can set caseSensitive option in applymapping.

    def applyMapping( mappings : Seq[Product4[String, String, String, String]], caseSensitive : Boolean = true, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame 
    

    https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping

    Btw, if I am not wrong, applyMapping is casesensitive by default. But spark SQL is case insensitive by default. To set casesensitive in spark SQL, you may use: spark_session.sql('set spark.sql.caseSensitive=true')