performance hadoop hive query-tuning apache-tez

hive analyze query taking lot of time

In order to speed up ETL queries on large tables, we run many analyze queries on these tables and date columns in the evening. but these analyze queries on columns take lot of memory and time. we are using tez. is there any way to optimize analyze query also like some set commands.

Solution

If you are loading tables using insert overwrite then statistics can be gathered automatically by setting hive.stats.autogather=true during insert overwrite queries.

If the table is partitioned and partitions are being loaded incrementally, then you can analyze only last partitions.

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]

See examples here: https://cwiki.apache.org/confluence/display/Hive/StatsDev

For ORC files it's possible to specify hive.stats.gather.num.threads to incraase parallelism.

See full list of statistic settings here: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics