Search code examples
hadoophivehadoop-partitioning

Hive Partitioned by Date -- Processing multiple dates at a time?


I might have a gap in understanding hive partitioning. I have an external table that is partitioned by date. I'm generating the parquet files via a query on a managed hive table. I currently run a bash script to process incrementally by date (which gets supplied to the query). Is there a better way to process multiple dates?

Table is created this way:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

CREATE EXTERNAL TABLE my_table (id int, product string)
PARTITIONED BY (dt string)
LOCATION '/path/to/location'

My bash script essentially iterates dates and inserts data via:

INSERT OVERWRITE TABLE my_table (dt='2020-07-09')
SELECT id, product FROM managed_table WHERE dt = '2020-07-09';

So this works, but it's only doing it a day at a time which doesn't seem very efficient. Is there a better way?


Solution

  • You are inserting record using static partition. In order to do dynamic partition, while INSERTING data, you can define the dynamic partition as below. You would also need to SET the hive dynamic partition rules. This is required for enabling dynamic partition.

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    
    INSERT OVERWRITE INTO TABLE my_table PARTITION(dt)
    SELECT 
     id, 
     product,
     dt
    FROM managed_table;
    

    More on dynamic partition in hive.