Search code examples
hadoophivecompressionparquetsnappy

How can I insert into a hive table with parquet fileformat and SNAPPY compression?


Hive 2.1

I have following table definition :

CREATE EXTERNAL TABLE table_snappy (
a STRING,
b INT) 
PARTITIONED BY (c STRING)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');

Now, I would like to insert data into it :

INSERT INTO table_snappy PARTITION (c='something') VALUES ('xyz', 1);

However, when I look into the data file, all I see is plain parquet file without any compression. How can I enable snappy compression in this case?

Goal : To have hive table data in parquet format and SNAPPY compressed.

I have tried setting multiple properties as well :

SET parquet.compression=SNAPPY;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET PARQUET_COMPRESSION_CODEC=snappy;

as well as

TBLPROPERTIES ('parquet.compression'='SNAPPY');

but nothing is being helpful. I tried the same with GZIP compression and it seem to be not working as well. I am starting to think if it's possible or not. Any help is appreciated.


Solution

  • One of the best ways to check if it is compressed or not, is by using parquet-tools.

    create external table testparquet (id int, name string) 
      stored as parquet 
      location '/user/cloudera/testparquet/'
      tblproperties('parquet.compression'='SNAPPY');
    
    insert into testparquet values(1,'Parquet');
    

    Now when you look at the file, it may not have .snappy anywhere

    [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/testparquet
    Found 1 items
    -rwxr-xr-x   1 anonymous supergroup        323 2018-03-02 01:07 /user/cloudera/testparquet/000000_0
    

    Let's inspect it further...

    [cloudera@quickstart ~]$ hdfs dfs -get /user/cloudera/testparquet/*
    [cloudera@quickstart ~]$ parquet-tools meta 000000_0 
    creator:     parquet-mr version 1.5.0-cdh5.12.0 (build ${buildNumber}) 
    
    file schema: hive_schema 
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
    id:          OPTIONAL INT32 R:0 D:1
    name:        OPTIONAL BINARY O:UTF8 R:0 D:1
    
    row group 1: RC:1 TS:99 
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
    id:           INT32 SNAPPY DO:0 FPO:4 SZ:45/43/0.96 VC:1 ENC:PLAIN,RLE,BIT_PACKED
    name:         BINARY SNAPPY DO:0 FPO:49 SZ:58/56/0.97 VC:1 ENC:PLAIN,RLE,BIT_PACKED
    [cloudera@quickstart ~]$ 
    

    it is snappy compressed.