Search code examples
jsonapache-sparkpyspark

spark.read.json throws COLUMN_ALREADY_EXISTS, column names differ by uppercase and type


I'm trying to read a huge unstructured JSON file in Spark. I came across an edge case that seems to be related to columns only differing by upper/lowercase and a type. Consider the script:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json("ldap5.json")

And input:

{"ldap":{"supportedLdapVersion":"3"}}
{"ldap":{"supportedLDAPVersion":["2","3"]}}

Here's the entire stack trace:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/12 09:28:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/12 09:28:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/context.py:317: FutureWarning: Python 3.7 support is deprecated in Spark 3.4.
  warnings.warn("Python 3.7 support is deprecated in Spark 3.4.", FutureWarning)
Traceback (most recent call last):
  File "run2.py", line 3, in <module>
    df = spark.read.json("ldap5.json")
  File "/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 418, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/d33tah/virtualenv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1323, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/errors/exceptions/captured.py", line 175, in deco
    raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: [COLUMN_ALREADY_EXISTS] The column `supportedldapversion` already exists. Consider to choose another name or rename the existing column.

And the versions:

> pyspark --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/
                        
Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.19
Branch HEAD
Compiled by user heartsavior on 2024-02-15T11:24:58Z
Revision fd86f85e181fc2dc0f50a096855acf83a6cc5d9c
Url https://github.com/apache/spark
Type --help for more information.

Assuming I have zero control over the JSON file, is there something I can do for the input to successfully load?


Solution

  • I later found this answer that applies here too:

    Try to set spark.sql.caseSensitive to true (false by default) spark.conf.set('spark.sql.caseSensitive', true) You can see in the source code its definition: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L833

    In addition, you can see in the JDBCWriteSuite how it affects the JDBC connector: https://github.com/apache/spark/blob/ee95ec35b4f711fada4b62bc27281252850bb475/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala