Search code examples
hivemapreducehiveqlhadoop2

Hive multilevel partitions and select with where clause


I have hive table with 2 partitions and the 1st partition is the city and the second partition is village, So every city partition will contains the list of all village partitions in it.Some thing like below

city1/village1
city1/village2
city1/village3
city2/village5
city2/village6

So if my select statement is select * from table where village = 'village5' will it search all the partitions in city 1 and city 2 before outputting the result? Or will it see the hive metastore file and hit only the village5 partition.


Solution

  • It will depend on your Hive version how optimize it is. In my current version (1.1.0) Hive is able to point to the specific partition without scanning the top partition

    Here is a quick demonstration.

    create table mydb.partition_test 
    (id string)
    partitioned by (city string, village string);
    
    INSERT OVERWRITE TABLE mydb.partition_test PARTITION (city,village)
    select * from (
    
    select '1', 'city1', 'village1'
    union all 
    select '1', 'city1', 'village2'
    union all 
    select '1', 'city1', 'village3'
    union all 
    select '1', 'city2', 'village5'
    union all 
    select '1', 'city2', 'village6'
    ) t;
    
    explain select * from mydb.partition_test where village='village5';
    
    STAGE DEPENDENCIES:
      Stage-0 is a root stage
    
    STAGE PLANS:
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            TableScan
              alias: partition_test
              filterExpr: (village = 'village5') (type: boolean)
              Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
              Select Operator
                expressions: id (type: string), city (type: string), 'village5' (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
                ListSink
    

    As you can see from the execution plan, it is able to estimate the number of records for that specific partition without mapred operation and the table scan is pointing to the specific partition.