apache-spark amazon-s3 pyspark parquet amazon-emr

Unable to infer schema for Parquet. It must be specified manually

I am running all the code from within EMR Notebooks.

spark.version

'3.0.1-amzn-0'

temp_df.printSchema()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- weekday: integer (nullable = true)

temp_df.show(2)

+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|        dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01|            25.544|                        1.171| Acre| Brazil|1855|    5|  1|      3|
|1855-06-01|            24.228|                        1.103| Acre| Brazil|1855|    6|  1|      6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows

temp_df.write.parquet(path='s3://project7878/clean_data/temperatures.parquet', mode='overwrite', partitionBy=['year'])

spark.read.parquet(path='s3://project7878/clean_data/temperatures.parquet').show(2)

An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

I have referred to other stack overflow posts, but the solution provided there (problem due to empty files written) does not apply to me.

Please help me out. Thank You !!

Solution

Don't use path in the read.parquet call:

>>> spark.read.parquet(path='a.parquet')
21/01/02 22:53:38 WARN DataSource: All paths were ignored:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home//bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/home//bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/home//bin/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
>>> spark.read.parquet('a.parquet')
DataFrame[_2: string, _1: double]

This is because the path argument does not exist.

It is valid if you use load

>>> spark.read.load(path='a', format='parquet')
DataFrame[_1: string, _2: string]