I'm trying to read two JSON files at a time from AWS S3 bucket. getting error PARSE_SYNTAX_ERROR
but when i'm reading single file, it is working fine.
Using AWS Glue for execution.
file_paths = [
"s3://test_bt/data/batch_date=20241217/20241217/data_0_0_0.json.gz",
"s3://test_bt/data/batch_date=20241217/20241217/data_0_1_0.json.gz"
]
df_single = spark.read.json(*file_paths)
df_single.show()
ERROR:
Spark Error Class: PARSE_SYNTAX_ERROR; Traceback (most recent call last):
File "/tmp/TEST.py", line 10, in <module>
df_single = spark.read.json(*s3_paths)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 254, in json
self._set_opts(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 50, in _set_opts
self.schema(schema) # type: ignore[attr-defined]
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 118, in schema
self._jreader = self._jreader.schema(schema)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.ParseException:
Syntax error at or near ':'(line 1, pos 2)
== SQL ==
s3://test_bt/data/batch_date=20241217/20241217/data_0_1_0.json.gz
--^^^
The read.json
function only accepts a single path argument. Passing a second argument, it will be interpreted as the schema
argument, for which the given value is invalid.