Search code examples
hiveorcacid

Is it possible to convert a hive table format to ORC and make it bucketed


I have a set of hive tables that are not in ORC format and also not bucketed. I want to change their formats to ORC as well as make them bucketed. Couldn't find a concrete answer throughout the net. Any answer or guidance is appreciated. Hive version is 2.3.5

Or if it is possible to do it in spark (pyspark or scala)?

The simplest solution would be to create a new table which is bucketed and is in ORC format then insert into it from the old table. Looking for an in-place solution.


Solution

  • Create bucketed table and load data into it using INSERT OVERWRITE:

    CREATE TABLE table_bucketed(col1 string, col2 string)
    CLUSTERED BY(col1) INTO 10 BUCKETS
    STORED AS ORC;
    
    INSERT OVERWRITE TABLE table_bucketed
    select ...
      from table_not_bucketed
    

    See also Sorted Bucketed Table.