I am running all the code from within EMR Notebooks.
spark.version
'3.0.1-amzn-0'
temp_df.printSchema()
root
|-- dt: string (nullable = true)
|-- AverageTemperature: double (nullable = true)
|-- AverageTemperatureUncertainty: double (nullable = true)
|-- State: string (nullable = true)
|-- Country: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- weekday: integer (nullable = true)
temp_df.show(2)
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
| dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01| 25.544| 1.171| Acre| Brazil|1855| 5| 1| 3|
|1855-06-01| 24.228| 1.103| Acre| Brazil|1855| 6| 1| 6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows
temp_df.write.parquet(path='s3://project7878/clean_data/temperatures.parquet', mode='overwrite', partitionBy=['year'])
spark.read.parquet(path='s3://project7878/clean_data/temperatures.parquet').show(2)
An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I have referred to other stack overflow posts, but the solution provided there (problem due to empty files written) does not apply to me.
Please help me out. Thank You !!
Don't use path
in the read.parquet call:
>>> spark.read.parquet(path='a.parquet')
21/01/02 22:53:38 WARN DataSource: All paths were ignored:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home//bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/home//bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/home//bin/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
>>> spark.read.parquet('a.parquet')
DataFrame[_2: string, _1: double]
This is because the path
argument does not exist.
It is valid if you use load
>>> spark.read.load(path='a', format='parquet')
DataFrame[_1: string, _2: string]