Search code examples
algorithmhadoopcompressionhdfsgzip

How gzip file gets stored in HDFS


HDFS storage support compression format to store compressed file. I know that gzip compression doesn't support splinting. Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. Now my question is:

  1. How this file will get stored in HDFS (Block size is 64MB)

From this link I came to know that The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks.

But I couldn't understand it completely and looking for broad explanation.

More doubts from gzip compressed file:

  1. How many block will be there for this 1GB gzip compressed file.
  2. Will it go on multiple datanode ?
  3. How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
  4. What is DEFLATE algorithm?
  5. Which algorithm is applied while reading the gzip compressed file?

I am looking here broad and detailed explanation.


Solution

  • How this file will get stored in HDFS (Block size is 64MB) if splitting does not supported for zip file format?

    All DFS blocks will be stored in single Datanode. If your block size is 64 MB and file is 1 GB, the Datanode with 16 DFS blocks ( 1 GB / 64 MB = 15.625) will store 1 GB file.

    How many block will be there for this 1GB gzip compressed file.

    1 GB / 64 MB = 15.625 ~ 16 DFS blocks

    How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)

    Same as of any other file. If the file is splittable, no change. If the file is not splittable, Datanodes with required number of blocks will be identified. In this case, 3 datanodes with 16 available DFS blocks.

    What is DEFLATE algorithm?

    DELATE is the algorithm to uncompress zipped files of GZIP format.