Search code examples
hadoophive

Difference in Results Between Two Hive Queries with and without a Time Partition Condition


I have two Hive queries that I'm running against a table, and I'm getting different results, but I'm not entirely sure why. I'd like to understand the reason for this difference.

Query 1:

SELECT member_id, record_ts, record_type
FROM record
WHERE member_id = "M-01"
ORDER BY record_ts

Query 2:

SELECT member_id, record_ts, record_type
FROM record
WHERE member_id = "M-01"
AND time_partition >= "2023-01"
ORDER BY record_ts

In both queries, I'm selecting records from the table "record" where "member_id" is equal to "M-01", and I'm ordering the results by the "record_ts" column. However, Query 2 includes an additional condition that filters the records based on the "time_partition" column.

My Question: Why does the first query produce an empty result, while the second query returns results? What could be the reason for this difference in results between these two Hive queries?

I'd appreciate any insights or explanations that can help me understand this behavior better. Thank you!

I tried adding hive.fetch.task.conversion=none but that didn't help


Solution

  • The table (record_staging) used to populate the record table uses bucketing_version'='2', this alters the way data is distributed and organized in the record_staging table. So from then on, the record table was being populated using a different bucketing strategy. The record table’s bucketing strategy is based on the member_id column that’s why it impacts queries related to member_id.

    This shift caused records to be stored differently than before, leading to the data issue we’ve observed. To fix this we just need to update the record table to use the same bucketing strategy.

    ALTER TABLE record SET TBLPROPERTIES('bucketing_version'='2');