Search code examples
apache-sparkamazon-s3pysparkparquetamazon-emr

Unable to infer schema for Parquet. It must be specified manually


I am running all the code from within EMR Notebooks.

spark.version

'3.0.1-amzn-0'

temp_df.printSchema()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- weekday: integer (nullable = true)

temp_df.show(2)

+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|        dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01|            25.544|                        1.171| Acre| Brazil|1855|    5|  1|      3|
|1855-06-01|            24.228|                        1.103| Acre| Brazil|1855|    6|  1|      6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows

temp_df.write.parquet(path='s3://project7878/clean_data/temperatures.parquet', mode='overwrite', partitionBy=['year'])

enter image description here

enter image description here

spark.read.parquet(path='s3://project7878/clean_data/temperatures.parquet').show(2)

An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

I have referred to other stack overflow posts, but the solution provided there (problem due to empty files written) does not apply to me.

Please help me out. Thank You !!


Solution

  • Don't use path in the read.parquet call:

    >>> spark.read.parquet(path='a.parquet')
    21/01/02 22:53:38 WARN DataSource: All paths were ignored:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home//bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
        return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
      File "/home//bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
      File "/home//bin/spark/python/pyspark/sql/utils.py", line 134, in deco
        raise_from(converted)
      File "<string>", line 3, in raise_from
    pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
    >>> spark.read.parquet('a.parquet')
    DataFrame[_2: string, _1: double]
    

    This is because the path argument does not exist.

    It is valid if you use load

    >>> spark.read.load(path='a', format='parquet')
    DataFrame[_1: string, _2: string]