Search code examples
hivetrino

What causes table corruption error when reading hive bucket table in trino?


First of all, the format of the data cannot be disclosed, but the bucket table is created through the following table creation statements and options in hive.

1. Create DDL

 CREATE EXTERNAL TABLE `schema1.ex1`(
`col1` string,
`col2` string,
`col3` string,
`col4` string
)
PARTITIONED BY(`date` string)
CLUSTERED BY (col1) SORTED BY(col2) INTO 32 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'...

2. Set Property

set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.execution.engine=mr;
set hive.mapreduce.job.queuename=root.queue1;

3. Insert Data (1 months)

Insert into table `schema1.ex1` partition(date)
select col1, col2, col3, col4 from `schema1.existing_table` where date between '2021-09-01' and '2021-10-01'

4. Read the data (in trino)

Then, When I read the hive bucket table in trino, I got the following error

 'io.trino.spi.TrinoException: Hive table is corrupt. 
File 'hdfs://cluster1/hive/warehouse/schema1/ex1/date=2021-09-02/000026_0' is for bucket 26, but contains a row for bucket 9.

I don't know why the above error occurred. I would appreciate it if you could tell me the cause of the error and the solution.

I use hive 3.1.2 and trino-356.


Solution

  • Trino implements bucket validation. For this error to be thrown, the table is most likely truly corrupt.

    Bucket validation verifies that data is in the correct bucket as it reads, and therefore attempts to prevent incorrect query results.

    To test, the following SET SESSION clause can be added to your Trino query:

    SET SESSION hive.validate_bucketing=false

    The query will run, but it should be examined for incorrectly bucketed data.