amazon-web-services pyspark apache-spark-sql aws-glue

pyspark.sql.utils.AnalysisException: Reference 'title' is ambiguous, could be: title, title

I am using glue version 3.0, python version 3, spark version 3.1. I am extracting data from xml creating dataframe and writing data to s3 path in csv format. Before writing dataframe I printed the schema and 1 record of dataframe using show(1). till this point everything was fine. but while writing it to csv file in s3 location got error duplicate column found as my dataframe had 2 columns namely "Title" and "title". tried to add a new column title2 which will have content of title and thought of dropping title later with below command

from pyspark.sql import functions as f df=df.withcoulumn('title2',f.expr("title")) but was getting error Reference 'title' is ambiguous, could be: title, title Tried df=df.withcoulumn('title2',f.col("title")) got same error. any help or approach to resolve this please..

Solution

By default spark is case in-sensitive, we can make spark sensitive by setting the spark.sql.caseSensitive to True.

from pyspark.sql import functions as f

df = spark.createDataFrame([("CaptializedTitleColumn", "title_column", ), ], ("Title", "title", ))

spark.conf.set('spark.sql.caseSensitive', True)

df.withColumn('title2',f.expr("title")) .show()

Output

+--------------------+------------+------------+
|               Title|       title|      title2|
+--------------------+------------+------------+
|CaptializedTitleC...|title_column|title_column|
+--------------------+------------+------------+