Search code examples
pythonamazon-web-servicesapache-sparkamazon-s3pyspark

Syntax error at or near ':'(line 1, pos 2) - PARSE_SYNTAX_ERROR - == SQL ==


I'm trying to read two JSON files at a time from AWS S3 bucket. getting error PARSE_SYNTAX_ERROR

but when i'm reading single file, it is working fine.

Using AWS Glue for execution.

file_paths = [
    "s3://test_bt/data/batch_date=20241217/20241217/data_0_0_0.json.gz",
    "s3://test_bt/data/batch_date=20241217/20241217/data_0_1_0.json.gz"
]

df_single = spark.read.json(*file_paths)

df_single.show()

ERROR:

Spark Error Class: PARSE_SYNTAX_ERROR; Traceback (most recent call last):
  File "/tmp/TEST.py", line 10, in <module>
     df_single = spark.read.json(*s3_paths)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 254, in json
    self._set_opts(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 50, in _set_opts
    self.schema(schema)  # type: ignore[attr-defined]
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 118, in schema
    self._jreader = self._jreader.schema(schema)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.ParseException: 
Syntax error at or near ':'(line 1, pos 2)

== SQL ==
s3://test_bt/data/batch_date=20241217/20241217/data_0_1_0.json.gz
--^^^

Solution

  • The read.json function only accepts a single path argument. Passing a second argument, it will be interpreted as the schema argument, for which the given value is invalid.