Search code examples
sqlhivehive-partitions

Insert overwrite on partitioned table is not deleting the existing data


I am trying to run insert overwrite over a partitioned table. The select query of insert overwrite omits one partition completely. Is it the expected behavior?

Table definition

CREATE TABLE `cities_red`(                                                              
  `cityid` int,                                                                         
  `city` string)                                                                        
PARTITIONED BY (                                                                        
  `state` string)                                                                       
ROW FORMAT SERDE                                                                        
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                                           
STORED AS INPUTFORMAT                                                                   
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                                     
OUTPUTFORMAT                                                                            
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                                    
TBLPROPERTIES (                                                                         
  'auto.purge'='true',                                                                  
  'last_modified_time'='1555591782',                                                    
  'transient_lastDdlTime'='1555591782');  

Table Data

+--------------------+------------------+-------------------+--+
| cities_red.cityid  | cities_red.city  | cities_red.state  |
+--------------------+------------------+-------------------+--+
| 13                 | KARNAL           | HARYANA           |
| 13                 | KARNAL           | HARYANA           |
| 1                  | Nagpur           | MH                |
| 22                 | Mumbai           | MH                |
| 22                 | Mumbai           | MH                |
| 755                | BPL              | MP                |
| 755                | BPL              | MP                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 12                 | NOIDA            | UP                |
| 12                 | NOIDA            | UP                |
+--------------------+------------------+-------------------+--+

Queries

insert overwrite table cities_red partition (state) select * from cities_red where city !='NOIDA';

It does not delete any data from the table

insert overwrite table cities_red partition (state) select * from cities_red where city !='Mumbai';

It removes the expected 2 rows from the table.

Is this an expected behavior from Hive in case of partitioned tables?


Solution

  • Yes, this is expected behavior.

    Insert overwrite table partition select ,,, overwrites only partitions existing in the dataset returned by select.

    In your example partition state=UP has records with city='NOIDA' only. Filter where city !='NOIDA' removes entire state=UP partition from the returned dataset and this is why it is not being rewritten.

    Filter city !='Mumbai' does not filter entire partition, it is partially returned, this is why it is being overwritten with filtered data.

    It works as designed. Consider scenario when you need to overwrite only desired partitions, this is quite normal for the incremental partition load. You do not need to touch other partitions in this case. You need to be able normally to overwrite only desired partitions. And without overwriting unchanged partitions, which can be very expensive to recover.

    And if you still want to drop partitions and modify data in existing partitions, then you can drop/create table (you may need to create one more intermediate table for this) and then load partitions into it. Or alternatively calculate partitions which you need to drop separately and execute ALTER TABLE DROP PARTITION.