Search code examples
hadoophdfsflume

Which is the easiest way to combine small HDFS blocks?


I'm collecting logs with Flume to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.

Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?


Solution

  • The GNU coreutils split could do the work.

    If the source data are lines - in my case they are - and one line is around 84 bytes, then an HDFS block 64MB could contain around 800000 lines:

    hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
    hadoop dfs -copyFromLocal ./joined_* /destdir/
    

    or with --line-bytes option:

    hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
    hadoop dfs -copyFromLocal ./joined_* /destdir/