Search code examples
sqlhivehiveqlhive-partitionshiveddl

How do I partition a table by all values?


I have an external table, now I want to add partitions to it. I have 224 unique city id's and I want to just write alter table my_table add partition (cityid) location /path; but hive complains, saying that I don't provide anything for the city id value, it should be e.g. alter table my_table add partition (cityid=VALUE) location /path;, but I don't want to run alter table commands for every value of city id, how can I do it for all id's in one go?

This is what hive command line looks like:

hive> alter table pavel.browserdata add partition (cityid) location '/user/maria_dev/data/cityidPartition';                                                                                                                                                                                                           

FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {cityid=null}


Solution

  • Partition on physical level is a location (separate location for each value, usually looks like key=value) with data files. If you already have partitions directory structure with files, all you need is to create partitions in Hive metastore, then you can point your table to the root directory using ALTER TABLE SET LOCATION, then use MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS. This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS

    If you have only not-partitioned table with data in it's location, then adding partitions will not work because the data needs to be reloaded, you need to:

    Create another partitioned table and use insert overwrite to load partition data using dynamic partition load:

    set hive.exec.dynamic.partition=true;   
    set hive.exec.dynamic.partition.mode=nonstrict; 
    
    insert overwrite table2 partition(cityid) 
    select col1, ... colN,
           cityid    
      from table1; --partitions columns should be last in the select
    

    This is quite efficient way to reorganize your data.

    After this you can delete source table and rename your target table.