I have a external hive table employee which is partitioned by extract_timestamp (yyyy-mm-dd hh:mm:ss) as below.
empid empname extract_time
1 abc 2019-05-17 00:00:00
2 def 2019-05-18 14:21:00
I am trying to remove the partition by extract_time and change it to year,month and day partition. I am following the below method for this.
1. Create a new table employee_new with partitions year month and day
create external table employee_new
(empid int,
empname string
)
partitioned by (year int,month int,day int)
location '/user/emp/data/employee_new.txt';
2. insert overwrite into employee_new by selecting data from employee table
insert overwrite into employee_new as select*,year(extract_time),month(extract_time)
,day(extract_time)
from employee
3. Drop employee and employee_new and create employee table on top of /user/emp/data/employee_new.txt
Please let me know if this method is efficient and if there are any better ways to do the same.
Partition by date yyyy-MM-dd
only, if possible, if upstream process can write hour files to daily folders. For such a small table partitioning by year, month and day separately seems overkill. It will be still too many folders.
If table is partitioned by date yyyy-MM-dd, partition pruning will work for your usage scenario because you are querying by day or year or month.
To filter by year in this case you will provide
where date >= '2019-01-01' and date < '2020-01-01'
condition,
to filter by month:
where date >= '2019-01-01' and date < '2020-02-01'
and day: where date = '2019-01-01'
Filesystem listing will work much faster.
And if it is not possible to redesign upstream process to write to yyyy-MM-dd folders then your new design as you described in the question (yyyy/MM/dd folders) is the only solution.