Search code examples
hadoophivefilesystemshdfsorc

What is the difference between a block and a stripe?


From Hive's docs:

If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data.

My question is: What is the difference between a block and a stripe?


Solution

  • HDFS blocks is the lowest level, ORC stripe is upper level, these levels are completely independent, stripes in ORC do not care about lower storage layer.

    HDFS blocks:

    • HDFS blocks is the lowest level, independent from file format. HDFS splits files in blocks to optimize storage.
    • One stripe can be stored in multiple blocks, one block can contain multiple stripes or part of the stripe. HDFS will split the file, not considering the stripe format or file format.
    • HDFS stores each file blocks metadata, writing and reading files is transparent for upper ORC reader level, HDFS will take care of all the blocks.

    ORC stripes:

    • upper level of storage. Stripe does know nothing about blocks.

    • ORC is splittable on stripe level. HDFS knows nothing about ORC structure and how it can be splitted for processing. HDFS splits files in blocks to optimize storage. Minimum one stripe can be processed in single container. You can configure stripe size to fit to the block size.

    Some useful links. please read for better understanding:

    HDFS blocks

    HDFS block vs Stripe

    ORC optimizing

    Big ORC stripes and block padding in S3 - very useful blog