Here is my scenario:
I load data from one of the folders in S3 bucket. Let's call this bucket 'new_data'.
CREATE EXTERNAL TABLE new_data (col1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://some-bucket/folder'
Next, I load data from another folder in S3, let's call this 'data_already_imported'.
CREATE EXTERNAL TABLE data_already_imported (col1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://some-bucket/another-folder'
Then, I find the difference between new_data and data_already_imported
CREATE TABLE difference AS SELECT * FROM ( SELECT n.* FROM new_data n LEFT OUTER JOIN data_already_imported old ON n.col1 = old.col1 WHERE old.col1 IS NULL ) i;
Now, I want to move this 'difference' into data_already_imported folder. Currently I have
INSERT INTO TABLE data_already_imported
SELECT * FROM difference;
The problem is this overwrites the data that was already there in data_already_imported. NOTE: I have tried it with INSERT OVERWRITE
as well. Could someone please point me what am I doing wrong here?
In Hive, there no such a thing as appending records to an already populated table. You have to create another table (let's say it will be called new_data_already_imported
) then do something like:
INSERT OVERWRITE TABLE new_data_already_imported
SELECT * FROM (
SELECT * FROM data_already_imported UNION ALL
SELECT * FROM difference );