Search code examples
hadoophiveorc

Can someone explain me the output of orcfiledump?


My table test_orc contains (for one partition):

col1 col2 part1
abc  def  1
ghi  jkl  1
mno  pqr  1
koi  hai  1
jo   pgl  1
hai  tre  1

By running

hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0

I get the following:

Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .  
2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl -  Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .  
Rows: 6 .  
Compression: ZLIB .  
Compression size: 262144 .  
Type: struct<_col0:string,_col1:string> .  

Stripe Statistics:   
  Stripe 1:   
    Column 0: count: 6 .  
    Column 1: count: 6 min: abc max: mno sum: 17 .  
    Column 2: count: 6 min: def max: tre sum: 18 .  

File Statistics:   
  Column 0: count: 6 .  
  Column 1: count: 6 min: abc max: mno sum: 17 .  
  Column 2: count: 6 min: def max: tre sum: 18 .  

Stripes:   
  Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .  
    Stream: column 0 section ROW_INDEX start: 3 length 9 .  
    Stream: column 1 section ROW_INDEX start: 12 length 29 .  
    Stream: column 2 section ROW_INDEX start: 41 length 29 .  
    Stream: column 1 section DATA start: 70 length 20 .  
    Stream: column 1 section LENGTH start: 90 length 12 .  
    Stream: column 2 section DATA start: 102 length 21 .  
    Stream: column 2 section LENGTH start: 123 length 5 .  
    Encoding column 0: DIRECT .  
    Encoding column 1: DIRECT_V2 .  
    Encoding column 2: DIRECT_V2 .  

What does the part about stripes mean?


Solution

  • First, let's see how an ORC file looks like.

    enter image description here

    Now some keywords used in above image and also in your question!

    • Stripe - A chunk of data stored in ORC file. Any ORC file is divided into those chunks, called stripes, each sized 250 MB with index data, actual data and some metadata for actual data stored in that stripe.
    • Compression - The compression codec used to compress the data stored. ZLIB is the default for ORC.
    • Index Data - includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

    • Row data - Actual data. Is used in table scans.

    • Stripe Footer - The stripe footer contains the encoding of each column and the directory of the streams including their location. To describe each stream, ORC stores the kind of stream, the column id, and the stream’s size in bytes. The details of what is stored in each stream depends on the type and encoding of the column.

    • Postscript - holds compression parameters and the size of the compressed footer.

    • File Footer - The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.

    Now! Talking about your output from orcfiledump.

    1. First is general information about your file. The name, location, compression codec, compression size etc.
    2. Stripe statistics will list all the stripes in your ORC file and its corresponding information. You can see counts and some statistics about Integer columns like min, max, sum etc.
    3. File statistics is similar to #2. Just for the complete file as opposed to each stripe in #2.
    4. Last part, the Stripe section, talks about each column in your file and corresponding index info for each of it.

    Also, you can use various options with orcfiledump to get "desired" results. Follows a handy guide.

    // Hive version 0.11 through 0.14:
    hive --orcfiledump <location-of-orc-file>
    
    // Hive version 1.1.0 and later:
    hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
    
    // Hive version 1.2.0 and later:
    hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
    
    // Hive version 1.3.0 and later:
    hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] 
        [--backup-path <new-path>] <location-of-orc-file-or-directory>
    

    Follows a quick guide to the options used in the commands above.

    • Specifying -d in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).
    • Specifying --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).
    • Specifying -t in the command will print the timezone id of the writer.
    • Specifying -j in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p to the command.
    • Specifying --recover in the command will recover a corrupted ORC file generated by Hive streaming.
    • Specifying --skip-dump along with --recover will perform recovery without dumping metadata.
    • Specifying --backup-path with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).
    • is the URI of the ORC file.
    • is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.

    Hope that helps!