I have an existing bucketed table that has YEAR, MONTH, DAY
partitioning, but I want to add additional partitioning by INGESTION_KEY
, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have to OVERWRITE
a YEAR, MONTH, DAY
partition every time I ingest data for that date; I can just do a simple INSERT INTO
and create a new INGESTION_KEY
partition.
I need a year's worth of data in my new table to start, so I want to copy a year of partitions from my existing table to a new table. Rather than doing a Hive INSERT
for each partition, I thought it would be quicker to use distcp
to copy files into the new table's partition directories in the Hive warehouse directory in HDFS, then ADD PARTITION
to the new table.
So, this is all I'm doing:
hadoop distcp /apps/hive/warehouse/src_db.db/src_tbl/year=2017/month=02/day=06 /apps/hive/warehouse/dest_db.db/dest_tbl/year=2017/month=02/day=06/ingestion_key=123
hive -e "ALTER TABLE dest_tbl ADD PARTITION (year=2017,month=02,day=06,ingestion_key='123')"
Both are managed tables, the new table dest_tbl
is clustered by the same column into the same number of buckets as the src_tbl
, and the only difference in schema is the addition of INGESTION_KEY
.
So far my SELECT * FROM dest_tbl
shows everything in the new table looking normal. So my question is: is there anything wrong with this approach? Is it bad to INSERT
to a managed, bucketed table this way, or is this an acceptable alternative to INSERT
if no transformations are being done on the copied data?
Thanks!!
Although i prefer copying by Hive query just to make it all in Hive, but it's ok to copy data files using other tools, but ..
There is a dedicated command that add the new partitions metadata, you can use it in place of alter table add partition..
, and it can add many partitions at once :
MSCK REPAIR TABLE dest_tbl;
Keep using Hive default partitioning format : partionKey=partitionValue