I have bucketed columns and even after setting all the parameter I am not getting any performance benefit. Below is the query I am using and the buckets I have created, I have added explain plan results as well.
select count(*) from bigtable_main a inner join
big_cnt10000 b where a.srrecordid = b.srrecordid;
---112 seconds....
ALTER TABLE bigtable_main CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;
ALTER TABLE big_cnt10000 CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;
---112 seconds....
---------------------------------------------------
SET hive.enforce.bucketing=true;
SET hive.optimize.bucketmapjoin=true;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
even the explain plan is same. Any idea?
Vertex dependency in root stage
Map 1 <- Map 3 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_13]
compressed:false
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [GBY_11]
| aggregations:["count(VALUE._col0)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_10]
sort order:
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions:_col0 (type: bigint)
Group By Operator [GBY_9]
aggregations:["count()"]
outputColumnNames:["_col0"]
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Select Operator [SEL_8]
Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
Filter Operator [FIL_16]
predicate:(_col0 = _col11) (type: boolean)
Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
Map Join Operator [MAPJOIN_19]
| condition map:[{"":"Inner Join 0 to 1"}]
| HybridGraceHashJoin:true
| keys:{"Map 3":"srrecordid (type: string)","Map 1":"srrecordid (type: string)"}
| outputColumnNames:["_col0","_col11"]
| Statistics:Num rows: 63339940 Data size: 6333994073 Basic stats: COMPLETE Column stats: NONE
|<-Map 3 [BROADCAST_EDGE]
| Reduce Output Operator [RS_5]
| key expressions:srrecordid (type: string)
| Map-reduce partition columns:srrecordid (type: string)
| sort order:+
| Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
| Filter Operator [FIL_18]
| predicate:srrecordid is not null (type: boolean)
| Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
| TableScan [TS_1]
| alias:b
| Statistics:Num rows: 85058 Data size: 8505810 Basic stats: COMPLETE Column stats: NONE
|<-Filter Operator [FIL_17]
predicate:srrecordid is not null (type: boolean)
Statistics:Num rows: 57581763 Data size: 5758176306 Basic stats: COMPLETE Column stats: NONE
TableScan [TS_0]
alias:a
Statistics:Num rows: 115163525 Data size: 11516352512 Basic stats: COMPLETE Column stats: NONE
Hive compiler needs the metadata and Meta information decides the execution plan. doc
The compiler needs the metadata so send a request for getMetaData and receives the sendMetaData request from MetaStore.
This metadata is used to typecheck the expressions in the query tree as well as to prune partitions based on query predicates. The plan generated by the compiler is a DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducer
Alter storage statements change the table's physical storage properties but not the meta.
To use of proper bucket drop and create table .
below is the link for detail.