Search code examples
scalaapache-sparkhiveparquet

Parquet column cannot be converted in file (...) Expected decimal, Found: FIXED_LEN_BYTE_ARRAY


After adding a column of type decimal(16,8) to the schema of a Hive table which already contained other decimal(38,18) columns, we're facing the following error when trying to read data from it from Spark:

Caused by: org.apache.spark.SparkException:
Parquet column cannot be converted in file hdfs://xxx.gz.parquet.
Column: [yyy], Expected: decimal(16,8), Found: FIXED_LEN_BYTE_ARRAY

The Spark code is not even supposed to read/use this column:

case class MyTable(otherColumn: String)

val query = """select * from my_table where ..."""
spark.sql(query).as[MyTable]

It sounds like Spark cannot deal with decimal(16,8) Hive columns. How to workaround that?

To be noted that reading data from the table using other clients works just fine, like with Hive JDBC or Trino.


Solution

  • It appears to be a limitation of Spark when reading such files/table.

    There are 2 solutions:

    • change the column type to decimal(x,y) with x > 18 if acceptable in your situation
    • disable the Spark vectorized reader using the configuration spark.sql.parquet.enableVectorizedReader=false at the price of a loss of performance (to be evaluated on your specific case)

    From Databricks documentation:

    The vectorized Parquet reader enables native record-level filtering using push-down filters, improving memory locality, and cache utilization. If you disable the vectorized Parquet reader, there may be a minor performance impact. You should only disable it, if you have decimal type columns in your source data.


    Related issues and links for more context: