I have a set of hive tables that are not in ORC format and also not bucketed. I want to change their formats to ORC as well as make them bucketed. Couldn't find a concrete answer throughout the net. Any answer or guidance is appreciated. Hive version is 2.3.5
Or if it is possible to do it in spark (pyspark or scala)?
The simplest solution would be to create a new table which is bucketed and is in ORC format then insert into it from the old table. Looking for an in-place solution.
Create bucketed table and load data into it using INSERT OVERWRITE:
CREATE TABLE table_bucketed(col1 string, col2 string)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;
INSERT OVERWRITE TABLE table_bucketed
select ...
from table_not_bucketed
See also Sorted Bucketed Table.