Search code examples
hadoophivehiveqlbuckethadoop-partitioning

Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?


I am trying to create a table in Hive

CREATE TABLE BUCKET_TABLE AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;

This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?


Solution

  • Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.

    This will never work, because of the following restrictions on CTAS:

    1. The target table cannot be a partitioned table.
    2. The target table cannot be an external table.
    3. The target table cannot be a list bucketing table.

    Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS

    Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

    CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
    ...
    [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
    ...
    [AS select_statement];

    Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.

    Optionally, you can ALTER the table and add buckets, but this does not change existing data.

    CREATE TABLE BUCKET_TABLE 
    STORED AS ORC AS 
    SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
    ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
    ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
    INSERT INTO BUCKET_TABLE 
    SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;