Search code examples
apache-sparkhivepysparkhqlhiveql

External table on parquet files. .count() works, .show() fails


I defined an external table on a group of partitioned parquet files as such:

CREATE EXTERNAL TABLE foobarbaz (
      src_file string,
      [...]
      temperature string
      )
      PARTITIONED BY (dt string)
      STORED AS PARQUET
      LOCATION '{1}'

If I then run

df = spark.table(foobarbaz)
print(df.count())

I get the correct non-zero result.

If I run

df = spark.table(foobarbaz)
df.show()

PySpark raises

py4j.protocol.Py4JJavaError: An error occurred while calling o95.showString. [...] Caused by: java.lang.UnsupportedOperationException

Why?

full traceback


Solution

  • I found an issue specific to my situation that may still be relevant to future readers. I extracted the schema using parquet-tools. One column was listed as int96 so in the schema definition I naively used int type for this column. Closer investigation revealed the column was of type datetime. Changing the schema definition accordingly resolved the issue.