amazon-web-services amazon-s3 hive insert

Hive S3 data insert keeps overwriting the data

Here is my scenario:

I load data from one of the folders in S3 bucket. Let's call this bucket 'new_data'.

CREATE EXTERNAL TABLE new_data (col1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://some-bucket/folder'
Next, I load data from another folder in S3, let's call this 'data_already_imported'.

CREATE EXTERNAL TABLE data_already_imported (col1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://some-bucket/another-folder'
Then, I find the difference between new_data and data_already_imported

CREATE TABLE difference AS SELECT * FROM ( SELECT n.* FROM new_data n LEFT OUTER JOIN data_already_imported old ON n.col1 = old.col1 WHERE old.col1 IS NULL ) i;

Now, I want to move this 'difference' into data_already_imported folder. Currently I have

INSERT INTO TABLE data_already_imported 
SELECT * FROM difference;

The problem is this overwrites the data that was already there in data_already_imported. NOTE: I have tried it with INSERT OVERWRITE as well. Could someone please point me what am I doing wrong here?

Solution

In Hive, there no such a thing as appending records to an already populated table. You have to create another table (let's say it will be called new_data_already_imported) then do something like:

INSERT OVERWRITE TABLE new_data_already_imported
  SELECT * FROM (
    SELECT * FROM data_already_imported UNION ALL
    SELECT * FROM difference );