Search code examples
hadoophivetez

Bucketing not working in hive


I have bucketed columns and even after setting all the parameter I am not getting any performance benefit. Below is the query I am using and the buckets I have created, I have added explain plan results as well.

select count(*) from bigtable_main a inner join 
big_cnt10000 b where a.srrecordid = b.srrecordid;
---112 seconds....

ALTER TABLE bigtable_main CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;
ALTER TABLE big_cnt10000 CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;

---112 seconds....
---------------------------------------------------
SET hive.enforce.bucketing=true;
SET hive.optimize.bucketmapjoin=true;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

even the explain plan is same. Any idea?
Vertex dependency in root stage
Map 1 <- Map 3 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2
         File Output Operator [FS_13]
            compressed:false
            Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Group By Operator [GBY_11]
            |  aggregations:["count(VALUE._col0)"]
            |  outputColumnNames:["_col0"]
            |  Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            |<-Map 1 [SIMPLE_EDGE]
               Reduce Output Operator [RS_10]
                  sort order:
                  Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  value expressions:_col0 (type: bigint)
                  Group By Operator [GBY_9]
                     aggregations:["count()"]
                     outputColumnNames:["_col0"]
                     Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                     Select Operator [SEL_8]
                        Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
                        Filter Operator [FIL_16]
                           predicate:(_col0 = _col11) (type: boolean)
                           Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
                           Map Join Operator [MAPJOIN_19]
                           |  condition map:[{"":"Inner Join 0 to 1"}]
                           |  HybridGraceHashJoin:true
                           |  keys:{"Map 3":"srrecordid (type: string)","Map 1":"srrecordid (type: string)"}
                           |  outputColumnNames:["_col0","_col11"]
                           |  Statistics:Num rows: 63339940 Data size: 6333994073 Basic stats: COMPLETE Column stats: NONE
                           |<-Map 3 [BROADCAST_EDGE]
                           |  Reduce Output Operator [RS_5]
                           |     key expressions:srrecordid (type: string)
                           |     Map-reduce partition columns:srrecordid (type: string)
                           |     sort order:+
                           |     Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
                           |     Filter Operator [FIL_18]
                           |        predicate:srrecordid is not null (type: boolean)
                           |        Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
                           |        TableScan [TS_1]
                           |           alias:b
                           |           Statistics:Num rows: 85058 Data size: 8505810 Basic stats: COMPLETE Column stats: NONE
                           |<-Filter Operator [FIL_17]
                                 predicate:srrecordid is not null (type: boolean)
                                 Statistics:Num rows: 57581763 Data size: 5758176306 Basic stats: COMPLETE Column stats: NONE
                                 TableScan [TS_0]
                                    alias:a
                                    Statistics:Num rows: 115163525 Data size: 11516352512 Basic stats: COMPLETE Column stats: NONE

Solution

  • Hive compiler needs the metadata and Meta information decides the execution plan. doc

    The compiler needs the metadata so send a request for getMetaData and receives the sendMetaData request from MetaStore.

    This metadata is used to typecheck the expressions in the query tree as well as to prune partitions based on query predicates. The plan generated by the compiler is a DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducer

    Alter storage statements change the table's physical storage properties but not the meta.

    To use of proper bucket drop and create table .

    below is the link for detail.

    NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data. Users should make sure the actual data layout conforms with the metadata definition.