Search code examples
hadoophivehiveqlhadoop2hadoop-partitioning

Hive Managed vs External tables maintainability


Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?

And by maintaining, i mean that these tables will have the following operations on daily basis frequently;

  1. Select using partitions most of the time.. but for some of it they are not used.
  2. Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
  3. Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?

Solution

  • It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.

    While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa

    alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
    

    As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.

    Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.

    You can refer the below links which can give you few pointers in considerations

    External Vs Managed tables comparison