Search code examples
hadoophivehiveqlbuckethadoop-partitioning

Can I create buckets in a Hive External Table?


I am creating an external table that refers to ORC files in an HDFS location. That ORC files are stored in such a way that the external table is partitioned by date (Mapping to date wise folders on HDFS, as partitions). However, I am wondering if I can enforce 'Bucketing' on these external tables because the underlying data/files are not 'managed' by hive. They are written externally and hence can bucketing be used in Hive External Tables?

Hive is allowing me to use the 'CLUSTERED BY' clause while creating an external table. But I am not able to understand how hive will redistribute the data into buckets, what is already written on HDFS as ORC files?

I have seen similar questions on PARTITION AND BUCKETING in External tables here:

Hive: Does hive support partitioning and bucketing while usiing external tables

and

Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

but the answers are talking only about Partition support in external tables or bucket support in MANAGED tables. I am aware of both those options and am already using it but need specific answers about bucketing support in Hive EXTERNAL tables.

So, In summary, Do Hive External Tables support bucketing? If yes, how is the data in the external folder redistributed into buckets by hive?


Solution

  • Yes, Hive does support bucketing and partitioning for external tables.

    Just try it:

    SET hive.tez.bucket.pruning=true;
    SET hive.optimize.sort.dynamic.partition=true;
    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    set hive.enforce.bucketing = true;
    
    drop table stg.test_v1;
    create external table stg.test_v1 
    (
           id bigint
          ,name string
    )
    partitioned by (created_date string)
    CLUSTERED BY(name) INTO 3 BUCKETS
    stored as ORC
    ;
    
    INSERT OVERWRITE TABLE stg.test_v1 partition(created_date)
    SELECT 
          id, name, created_date
    FROM
    (select stack(3, 
                 1, 'Valeriy', '2020-01-01',
                 2, 'Victor', '2020-01-01',
                 3, 'Ankit', '2020-01-01'
                 ) as(id, name, created_date)
    )s;
    

    DESC FORMATTED says:

    Table Type:          EXTERNAL_TABLE 
    ...
    Num Buckets:         3                     
    Bucket Columns:      [name] 
    

    Load more rows and you will see, it will create 3 files per partition

    See also this documentation for more details about features supported for Managed and External tables: Managed vs External Tables.