hadoop hive query-optimization hadoop-yarn

Approach to reduce the execution time of a Hive query

We run this below query daily and this query runs for 3 hours or so, owing due to sheer volume of data in the transaction table. Is there any way we can tune this query or reduce the execution time?

   CREATE TEMPORARY TABLE t1 AS
    SELECT DISTINCT EVENT_DATE FROM (
      SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
      WHERE load_date BETWEEN CAST(CAST('2019-03-05 04:00:31.0' AS TIMESTAMP) AS DATE) AND CURRENT_DATE() AND  event_title = 'SETUP'
      AND state != 'INACTIVE' AND mode != 'DORMANT') T

I tried to reduce the number of reducers to help speed up, and also tried to enable vectorization but not much luck here. We are running on tez.

Solution

You do not need to apply DISTINCT two times
If table mstr_wrk.cust_transation is partitioned by load_date, partition pruning will not work because you are using functions. This will cause table full scan. Calculate dates in the shell script and pass as a parameters

Check this script performance before parametrizing your script

  CREATE TEMPORARY TABLE t1 AS
      SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
      WHERE load_date >= '2019-03-05' AND load_date <= '2019-03-07' 
            AND  event_title = 'SETUP'
            AND state != 'INACTIVE' AND mode != 'DORMANT'