I'm collecting logs with Flume to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.
Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?
The GNU coreutils split could do the work.
If the source data are lines - in my case they are - and one line is around 84 bytes
, then an HDFS block 64MB
could contain around 800000
lines:
hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/
or with --line-bytes
option:
hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/