Search code examples
amazon-web-servicespysparkapache-spark-sqlaws-glue

pyspark.sql.utils.AnalysisException: Reference 'title' is ambiguous, could be: title, title


I am using glue version 3.0, python version 3, spark version 3.1. I am extracting data from xml creating dataframe and writing data to s3 path in csv format. Before writing dataframe I printed the schema and 1 record of dataframe using show(1). till this point everything was fine. but while writing it to csv file in s3 location got error duplicate column found as my dataframe had 2 columns namely "Title" and "title". tried to add a new column title2 which will have content of title and thought of dropping title later with below command

from pyspark.sql import functions as f df=df.withcoulumn('title2',f.expr("title")) but was getting error Reference 'title' is ambiguous, could be: title, title Tried df=df.withcoulumn('title2',f.col("title")) got same error. any help or approach to resolve this please..


Solution

  • By default spark is case in-sensitive, we can make spark sensitive by setting the spark.sql.caseSensitive to True.

    from pyspark.sql import functions as f
    
    df = spark.createDataFrame([("CaptializedTitleColumn", "title_column", ), ], ("Title", "title", ))
    
    spark.conf.set('spark.sql.caseSensitive', True)
    
    df.withColumn('title2',f.expr("title")) .show()
    

    Output

    +--------------------+------------+------------+
    |               Title|       title|      title2|
    +--------------------+------------+------------+
    |CaptializedTitleC...|title_column|title_column|
    +--------------------+------------+------------+