I'm updating a Pyspark script with a new Database model and I've encountered some problems calling/updating columns since PySpark apparently brings all columns in uppercase but when I use ApplyMapping it is not case sensitive BUT when I join(By left) with another table it is case sensitive and I end up with multiple columns with the same name but one of them in uppercase and the other one in lowercase and I want to use SelectFields function.
I've tried the exact same name (Case sensitive) of the columns but it always brings the same.
I've tried printing the schema but the only difference is only the case.
testDF = testDF.join(test2DF, "COLUMN",how='left')
test3DF=test3DF.withColumn("column", test3DF['ATTRB'].substr(4,4))
fullDF= testDF.join(test3DF, (testDF['ID'] == test3DF['ID']) )
.....
applymappingTest = ApplyMapping.apply(frame = fullDF, mappings = [
('ID', 'string', 'identifier', 'string')
('column', 'string', 'myColumn', 'string')
('COLUMN', 'string', 'someother', 'string')
], transformation_ctx = "applymappingTest")
......
selectfieldsTest= SelectFields.apply(frame = applymappingTest, paths = [
"identifier",
"myColumn",
], transformation_ctx = "selectfieldsTest")
Expected result:
myColumn is the column with the name in lowercase.
Actual result:
myColumn is the column with the name in uppercase.
You can set caseSensitive option in applymapping.
def applyMapping( mappings : Seq[Product4[String, String, String, String]], caseSensitive : Boolean = true, transformationContext : String = "", callSite : CallSite = CallSite("Not provided", ""), stageThreshold : Long = 0, totalThreshold : Long = 0 ) : DynamicFrame
Btw, if I am not wrong, applyMapping is casesensitive by default. But spark SQL is case insensitive by default. To set casesensitive in spark SQL, you may use:
spark_session.sql('set spark.sql.caseSensitive=true')