apache-spark hive amazon-emr apache-hudi

Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1

Hudi version is 0.7

I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under the same partition with updated data/records.

Now, when I query the same table using Spark SQL, it works fine and does not return any duplicates. Basically, it only honours the latest records/parquet files for processing. It also works fine when I use Tez as the underlying engine for Hive.

But, when I run the same query on Hive prompt with Spark as underlying execution engine, it returns all the records and does not filter the previous parquet files.

Have tried setting the property spark.sql.hive.convertMetastoreParquet=false, still it did not work.

Please help.

Solution

This is a known issue in Hudi.

Still, using the below property, I am able to remove the duplicates in RO (read optimised) Hudi tables. The issue still persists in RT table (real time).

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat