Search code examples
hadoophivebucket

Can in insert data multiple times into a bucketed hive table


I have a bucketed hive table. It has 4 buckets.

CREATE TABLE user(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
CLUSTERED BY(user_id) INTO 4 BUCKETS;

Initially i have inserted some records into this table using the following query.

set hive.enforce.bucketing = true;
insert into user
select * from second_user;

After this operation In HDFS I see that 4 files are created under this table dir.

Again i needed to insert another set of data into user table. So i ran the below query.

set hive.enforce.bucketing = true;
insert into user
select * from third_user;

Now another 4 files are crated under user folder dir. Now it has total 8 files.

Is this fine to do this kind of multiple inserts into a bucketed table? Does it affect the bucketing of the table?


Solution

  • I figured it out!! Actually if you do multiple inserts on a bucketed hive table. Hive wont complain as such. All hive queries will work fine.

    Having said that, Such operation spoils the bucketing concept of the table. I mean after multiple inserts into a bucketed table the sampling fails.

    The TABLASAMPLE doesnt work properly after multiple inserts.

    Even sort merge bucket map join also doesnt work after such operation.