I defined an external table on a group of partitioned parquet files as such:
CREATE EXTERNAL TABLE foobarbaz (
src_file string,
[...]
temperature string
)
PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION '{1}'
If I then run
df = spark.table(foobarbaz)
print(df.count())
I get the correct non-zero result.
If I run
df = spark.table(foobarbaz)
df.show()
PySpark raises
py4j.protocol.Py4JJavaError: An error occurred while calling o95.showString. [...] Caused by: java.lang.UnsupportedOperationException
Why?
I found an issue specific to my situation that may still be relevant to future readers. I extracted the schema using parquet-tools. One column was listed as int96
so in the schema definition I naively used int
type for this column. Closer investigation revealed the column was of type datetime
. Changing the schema definition accordingly resolved the issue.