hadoop hive hadoop-yarn hiveql apache-tez

Query taking time despite adding session settings

Following is the ETL generated query

Query -

SELECT infaHiveSysTimestamp('SS') as a0, 7991 as a1, single_use_subq30725.a1 as a2, SUBSTR(SUBSTR(single_use_subq30725.a2, 0, 5), 0, 5) as a3, CAST(1 AS SMALLINT) as a4, single_use_subq30725.a3 as a5, single_use_subq30725.a4 as a6, SUBSTR(SUBSTR(SUBSTR(single_use_subq30725.a8, (CASE WHEN 12 < (- LENGTH(single_use_subq30725.a8)) THEN 0 ELSE 12 END), 104857600), 0, 20), 0, 20) as a7, infaNativeUDFCallString('TO_CHAR', single_use_subq30725.a5) as a8, infaHiveSysTimestamp('SS') as a9, CAST(infaNativeUDFCallDate('TRUNC', single_use_subq30725.a6, 'DD') AS DATE) as a10 FROM (SELECT (CASE WHEN 1 = t1.a1 THEN t1.a0 ELSE CAST(NULL AS TIMESTAMP) END) as a0, infaNativeUDFCallDate('TRUNC', (CASE WHEN 1 = t1.a1 THEN t1.a0 ELSE CAST(NULL AS TIMESTAMP) END), 'DD') as a1 
FROM 
    (
        SELECT MAX(t1.a0) as a0, MAX(t1.a1) as a1 
        FROM (
            SELECT mstr_load_audit.last_run_ts as a0, 1 as a1 FROM mstr_etl.mstr_load_audit WHERE interface_name='m_CTM_RAWTLogData_target_tbl'
            ) t1
        )t1
    ) single_use_subq39991
JOIN (
    SELECT w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.CREATE_TS as a0, CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.txACTION_ID AS STRING) as a1, SUBSTR(SUBSTR(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.STORE_NUM AS DECIMAL(18, 0))), (CASE WHEN 0 < (- LENGTH(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.STORE_NUM AS DECIMAL(18, 0))))) THEN 0 ELSE 0 END), 10), (CASE WHEN 0 < (- LENGTH(SUBSTR(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.STORE_NUM AS DECIMAL(18, 0))), (CASE WHEN 0 < (- LENGTH(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.STORE_NUM AS DECIMAL(18, 0))))) THEN 0 ELSE 0 END), 10))) THEN 0 ELSE 0 END), 10) as a2, SUBSTR(SUBSTR(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.LANE_NUM AS DECIMAL(18, 0))), (CASE WHEN 0 < (- LENGTH(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.LANE_NUM AS DECIMAL(18, 0))))) THEN 0 ELSE 0 END), 10), (CASE WHEN 0 < (- LENGTH(SUBSTR(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.LANE_NUM AS DECIMAL(18, 0))), (CASE WHEN 0 < (- LENGTH(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.LANE_NUM AS DECIMAL(18, 0))))) THEN 0 ELSE 0 END), 10))) THEN 0 ELSE 0 END), 10) as a3, SUBSTR(SUBSTR(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.tx_NUM AS DECIMAL(18, 0))), (CASE WHEN 0 < (- LENGTH(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.tx_NUM AS DECIMAL(18, 0))))) THEN 0 ELSE 0 END), 20), (CASE WHEN 0 < (- LENGTH(SUBSTR(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.tx_NUM AS DECIMAL(18, 0))), (CASE WHEN 0 < (- LENGTH(infaNativeUDFCallString('TO_CHAR', CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.tx_NUM AS DECIMAL(18, 0))))) THEN 0 ELSE 0 END), 20))) THEN 0 ELSE 0 END), 20) as a4, CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.LOYALTY_DEV_NUM AS DECIMAL(28, 0)) as a5, CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.tx_DT AS TIMESTAMP) as a6, CAST(w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.ETL_LOAD_DT AS TIMESTAMP) as a7, w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.tx_TS as a8 
FROM 
sourcedb.W5883634877684653839_Read_lcl_tlog_raw_2_VIEW__m_CTM_RAWTLogData_target_tbl
) 
single_use_subq30725 
WHERE (single_use_subq39991.a0 < single_use_subq30725.a0) AND (single_use_subq39991.a1 <= single_use_subq30725.a7)]

As this query is generated in hive pushdown mode, we have added following settings in the environment sql

SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
set hive.exec.orc.split.strategy=BI;
set hive.merge.tezfiles=true;

But we did not see any significant gains. We do have an option of moving this job in traditional batch mode - where we run this via shell script. Is there any scope of making any changes to the query there by reducing the execution time. I am sure we can get rid of all type conversions and reduce execution time there. Are there any additional things, we can try.

Solution

This join in your query:

JOIN (
    SELECT 
 ... SKIPPED ...
) 
single_use_subq30725 
WHERE (single_use_subq39991.a0 < single_use_subq30725.a0) AND (single_use_subq39991.a1 <= single_use_subq30725.a7)]

works as CROSS JOIN because no ON condition specified. After this CROSS JOIN, the dataset is being filtered using this WHERE (single_use_subq39991.a0 < single_use_subq30725.a0) AND (single_use_subq39991.a1 <= single_use_subq30725.a7)

Actually it does not multiply rows and should work as MAP-JOIN because first subquery returns one row maximum:

SELECT MAX(t1.a0) as a0, MAX(t1.a1) as a1

Add this setting to enable map-join: set hive.auto.convert.join=true; Check map-join is in the EXPLAIN output.

But the biggest problem is not this CROSS (MAP?) join itself. It prevents predicate push-down to work before join, when reading table in second query.

I suggest to remove join at all and calculate first query once and provide a0 and a1 as a parameters in the where clause. In such way you will eliminate unnnecessary join and predicate push-down may work directly.

For example PPD could be applied to this column: w5883634877684653839_read_lcl_tlog_raw_2_view__m_ctm_rawtlogdata_target_tbl.CREATE_TS as a0

Check PPD and other performance settings: https://stackoverflow.com/a/48296562/2700344