Search code examples
hadoophivequery-optimizationhadoop-yarn

Approach to reduce the execution time of a Hive query


We run this below query daily and this query runs for 3 hours or so, owing due to sheer volume of data in the transaction table. Is there any way we can tune this query or reduce the execution time?

   CREATE TEMPORARY TABLE t1 AS
    SELECT DISTINCT EVENT_DATE FROM (
      SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
      WHERE load_date BETWEEN CAST(CAST('2019-03-05 04:00:31.0' AS TIMESTAMP) AS DATE) AND CURRENT_DATE() AND  event_title = 'SETUP'
      AND state != 'INACTIVE' AND mode != 'DORMANT') T

I tried to reduce the number of reducers to help speed up, and also tried to enable vectorization but not much luck here. We are running on tez.


Solution

    1. You do not need to apply DISTINCT two times
    2. If table mstr_wrk.cust_transation is partitioned by load_date, partition pruning will not work because you are using functions. This will cause table full scan. Calculate dates in the shell script and pass as a parameters

    Check this script performance before parametrizing your script

      CREATE TEMPORARY TABLE t1 AS
          SELECT DISTINCT EVENT_DATE FROM mstr_wrk.cust_transation
          WHERE load_date >= '2019-03-05' AND load_date <= '2019-03-07' 
                AND  event_title = 'SETUP'
                AND state != 'INACTIVE' AND mode != 'DORMANT'