Search code examples
hadoophiveparquetorc

Why do orc files consume more space than parquet files in Hive?


As far as I understand, ORC files are supposed to be smaller and more compressed than parquet files. However, when I populate my orc table in Apache Hive by selecting rows from my parquet table, the orc table ends up consuming about 7 times more disk space.

Any ideas why this happens? My table schema is as follows. It contains a length 200000 array of integers 0, 1 and 2 and each partition has about 10000 rows.

CREATE TABLE orc_table (
  field1 STRING
, field2 INT
, field3 STRING
, field4 STRING
, array_field ARRAY < INT >
) PARTITIONED BY (
    partition_name STRING
);

ALTER TABLE orc_table ADD PARTITION (partition_name='<partition-name>');

ALTER TABLE orc_table PARTITION (partition_name='<partition_name>') SET FILEFORMAT ORC;

INSERT INTO TABLE orc_table PARTITION (partition_name='<partition_name>')
SELECT field1, field2, field3, field4, array_field FROM parquet_table
WHERE partition_name='<partition_name>';

Solution

  • Changing these settings solved the problem:

    SET hive.exec.compress.intermediate=true;
    SET hive.exec.compress.output=true;
    SET mapred.output.compression.type=BLOCK;
    

    Appearently, Hive uses map-reduce for converting between the data formats. Therefore, also map-reduce output compression needs to be switched on. But this is only guess.