As far as I understand, ORC files are supposed to be smaller and more compressed than parquet files. However, when I populate my orc table in Apache Hive by selecting rows from my parquet table, the orc table ends up consuming about 7 times more disk space.
Any ideas why this happens? My table schema is as follows. It contains a length 200000 array of integers 0, 1 and 2 and each partition has about 10000 rows.
CREATE TABLE orc_table (
field1 STRING
, field2 INT
, field3 STRING
, field4 STRING
, array_field ARRAY < INT >
) PARTITIONED BY (
partition_name STRING
);
ALTER TABLE orc_table ADD PARTITION (partition_name='<partition-name>');
ALTER TABLE orc_table PARTITION (partition_name='<partition_name>') SET FILEFORMAT ORC;
INSERT INTO TABLE orc_table PARTITION (partition_name='<partition_name>')
SELECT field1, field2, field3, field4, array_field FROM parquet_table
WHERE partition_name='<partition_name>';
Changing these settings solved the problem:
SET hive.exec.compress.intermediate=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
Appearently, Hive uses map-reduce for converting between the data formats. Therefore, also map-reduce output compression needs to be switched on. But this is only guess.