Search code examples
hivehiveqlpartitionhive-partitionshiveddl

Changing the partition spec of a hive table and move data


I have a external hive table employee which is partitioned by extract_timestamp (yyyy-mm-dd hh:mm:ss) as below.

empid   empname     extract_time
1       abc         2019-05-17 00:00:00
2       def         2019-05-18 14:21:00

I am trying to remove the partition by extract_time and change it to year,month and day partition. I am following the below method for this.

1. Create a new table employee_new with partitions year month and day

 create external table employee_new
(empid int,
 empname string
)
partitioned by (year int,month int,day int)
location '/user/emp/data/employee_new.txt';

2. insert overwrite into employee_new by selecting data from employee table

insert overwrite into employee_new as select*,year(extract_time),month(extract_time)
,day(extract_time) 
from employee

3. Drop employee and employee_new and create employee table on top of /user/emp/data/employee_new.txt

Please let me know if this method is efficient and if there are any better ways to do the same.


Solution

  • Partition by date yyyy-MM-dd only, if possible, if upstream process can write hour files to daily folders. For such a small table partitioning by year, month and day separately seems overkill. It will be still too many folders. If table is partitioned by date yyyy-MM-dd, partition pruning will work for your usage scenario because you are querying by day or year or month.

    To filter by year in this case you will provide

    where date >= '2019-01-01' and date < '2020-01-01' condition,

    to filter by month:

    where date >= '2019-01-01' and date < '2020-02-01'

    and day: where date = '2019-01-01'

    Filesystem listing will work much faster.

    And if it is not possible to redesign upstream process to write to yyyy-MM-dd folders then your new design as you described in the question (yyyy/MM/dd folders) is the only solution.