Search code examples
hadoophivesnappyorc

Hive Snappy Uncompressed length must be less


Querying the table below using a join on the table itself results in the following exception:

java.lang.IllegalArgumentException: Uncompressed length 222258 must be less than 131072
at org.iq80.snappy.SnappyInternalUtils.checkArgument(SnappyInternalUtils.java:116)
        at org.iq80.snappy.SnappyDecompressor.uncompress(SnappyDecompressor.java:72)
        at org.iq80.snappy.Snappy.uncompress(Snappy.java:43)
        at org.apache.hadoop.hive.ql.io.orc.SnappyCodec.decompress(SnappyCodec.java:71)
        at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:214)
        at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.available(InStream.java:251)

The problematic query is the following:

select a.*
from events a
inner join
(
  SELECT asset_id, time, max(hive_insert_ts)
  FROM events
  GROUP BY asset_id, time
) b on a.time = b.time
and a.asset_id = b.asset_id
limit 10;

The table is stored as ORC and compressed using SNAPPY:

create table events(
    asset_id varchar(15),
    time timestamp,
    hive_insert_ts timestamp)
PARTITIONED BY (
    country varchar(4),
    site varchar(4),
    year int,
    month int)
STORED as ORC
TBLPROPERTIES (
'orc.compress'='SNAPPY',
'orc.create.index'='true',
'orc.bloom.filter.columns'='asset_id, time',
'orc.bloom.filter.fpp'='0.05',
'orc.stripe.size'='268435456',
'orc.row.index.stride'='10000');

I searched a lot but could not find any hint. Do you have an idea where the problem could be?

Thanks a lot!


Solution

  • I found the solution (just in case someone runs into the same problem). It was caused by a misconfiguration:

    The "orc.compress.size" tableproperty is set by default to

    'orc.compress.size'='262144' which is 256kbytes

    but "io.file.buffer.size" in core-site.xml is set to "131072" whis is 128kbytes.

    The streamreader wants to read 131072 bytes which does not fit into the filebuffer after decompression.

    The solution is to either increase the filebuffer size or decrease the compression size of the ORC table.

    I hope this helps someday also someone else.