I have hive table with 2 partitions and the 1st partition is the city and the second partition is village, So every city partition will contains the list of all village partitions in it.Some thing like below
city1/village1
city1/village2
city1/village3
city2/village5
city2/village6
So if my select statement is select * from table where village = 'village5'
will it search all the partitions in city 1 and city 2 before outputting the result? Or will it see the hive metastore file and hit only the village5 partition.
It will depend on your Hive version how optimize it is. In my current version (1.1.0) Hive is able to point to the specific partition without scanning the top partition
Here is a quick demonstration.
create table mydb.partition_test
(id string)
partitioned by (city string, village string);
INSERT OVERWRITE TABLE mydb.partition_test PARTITION (city,village)
select * from (
select '1', 'city1', 'village1'
union all
select '1', 'city1', 'village2'
union all
select '1', 'city1', 'village3'
union all
select '1', 'city2', 'village5'
union all
select '1', 'city2', 'village6'
) t;
explain select * from mydb.partition_test where village='village5';
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: partition_test
filterExpr: (village = 'village5') (type: boolean)
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: id (type: string), city (type: string), 'village5' (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
ListSink
As you can see from the execution plan, it is able to estimate the number of records for that specific partition without mapred operation and the table scan is pointing to the specific partition.