Search code examples
amazon-web-servicesamazon-s3hiveinsert

Hive S3 data insert keeps overwriting the data


Here is my scenario:

  1. I load data from one of the folders in S3 bucket. Let's call this bucket 'new_data'.

    CREATE EXTERNAL TABLE new_data (col1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://some-bucket/folder'

  2. Next, I load data from another folder in S3, let's call this 'data_already_imported'.

    CREATE EXTERNAL TABLE data_already_imported (col1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://some-bucket/another-folder'

  3. Then, I find the difference between new_data and data_already_imported

    CREATE TABLE difference AS SELECT * FROM ( SELECT n.* FROM new_data n LEFT OUTER JOIN data_already_imported old ON n.col1 = old.col1 WHERE old.col1 IS NULL ) i;

Now, I want to move this 'difference' into data_already_imported folder. Currently I have

INSERT INTO TABLE data_already_imported 
SELECT * FROM difference; 

The problem is this overwrites the data that was already there in data_already_imported. NOTE: I have tried it with INSERT OVERWRITE as well. Could someone please point me what am I doing wrong here?


Solution

  • In Hive, there no such a thing as appending records to an already populated table. You have to create another table (let's say it will be called new_data_already_imported) then do something like:

    INSERT OVERWRITE TABLE new_data_already_imported
      SELECT * FROM (
        SELECT * FROM data_already_imported UNION ALL
        SELECT * FROM difference );