Search code examples
performancehadoophivequery-tuningapache-tez

hive analyze query taking lot of time


In order to speed up ETL queries on large tables, we run many analyze queries on these tables and date columns in the evening. but these analyze queries on columns take lot of memory and time. we are using tez. is there any way to optimize analyze query also like some set commands.


Solution

  • If you are loading tables using insert overwrite then statistics can be gathered automatically by setting hive.stats.autogather=true during insert overwrite queries.

    If the table is partitioned and partitions are being loaded incrementally, then you can analyze only last partitions.

    ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] 
    

    See examples here: https://cwiki.apache.org/confluence/display/Hive/StatsDev

    For ORC files it's possible to specify hive.stats.gather.num.threads to incraase parallelism.

    See full list of statistic settings here: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics