Search code examples
hadoopgzipmapperbzip2

BZip2 file read in Hadoop


I heard we can use multiple mappers to read different parts of one bzip2 file in parallel in Hadoop, to increase performance. But I cannot find related samples after search. Appreciate if anyone could point me to related code snippet. Thanks.

BTW: is gzip has the same feature (multiple mapper process different parts of one gzip file in parallel).


Solution

  • If you look at: http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/30662, you will find that bzip2 format is indeed splittable and multiple mappers can work on one file. The patch was submitted at: https://issues.apache.org/jira/browse/HADOOP-4012. However, it seems it is available only above HADOOP 0.21.0.

    From personal experience in order to use this technique of bzip2 there is nothing different that you need to do. hadoop should pick it up automatically depending on your min split size.

    bzip2 compressed data by blocks and therefore it is possible to decompress it in blocks and send each block to a separate mapper. However, gzip does not have such a technique and therefore this cannot be sent to different mappers.