I'm trying to read a huge unstructured JSON file in Spark. I came across an edge case that seems to be related to columns only differing by upper/lowercase and a type. Consider the script:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json("ldap5.json")
And input:
{"ldap":{"supportedLdapVersion":"3"}}
{"ldap":{"supportedLDAPVersion":["2","3"]}}
Here's the entire stack trace:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/12 09:28:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/12 09:28:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/context.py:317: FutureWarning: Python 3.7 support is deprecated in Spark 3.4.
warnings.warn("Python 3.7 support is deprecated in Spark 3.4.", FutureWarning)
Traceback (most recent call last):
File "run2.py", line 3, in <module>
df = spark.read.json("ldap5.json")
File "/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 418, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/d33tah/virtualenv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1323, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/errors/exceptions/captured.py", line 175, in deco
raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: [COLUMN_ALREADY_EXISTS] The column `supportedldapversion` already exists. Consider to choose another name or rename the existing column.
And the versions:
> pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.1
/_/
Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.19
Branch HEAD
Compiled by user heartsavior on 2024-02-15T11:24:58Z
Revision fd86f85e181fc2dc0f50a096855acf83a6cc5d9c
Url https://github.com/apache/spark
Type --help for more information.
Assuming I have zero control over the JSON file, is there something I can do for the input to successfully load?
I later found this answer that applies here too:
Try to set
spark.sql.caseSensitive
totrue
(false
by default)spark.conf.set('spark.sql.caseSensitive', true)
You can see in the source code its definition: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L833In addition, you can see in the
JDBCWriteSuite
how it affects the JDBC connector: https://github.com/apache/spark/blob/ee95ec35b4f711fada4b62bc27281252850bb475/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala