Search code examples
hadoophdfshadoop-yarnhadoop2

How Hadoop -getmerge works?


In hadoop getmerge description

Usage: hdfs dfs -getmerge src localdst [addnl]

My question is why getmerge is concatenating to the local destination why not hdfs itself ? This question was asked because i have this following problems

  1. What if the files to be merged are more than the size of the local?
  2. Is there any specific reason behind restricting hadoop -getmerge command to only to concatenate to local-destination?

Solution

  • The getmerge command has been created specifically for merging files from HDFS into a single file on local file system.

    This command is very useful to download the output of a MapReduce job, which could have generated multiple part-* files and combine them into a single file locally, which you can use for other operations (for e.g. put it in an Excel sheet for presentation).

    Answers to your questions:

    1. If the destination file system does not have enough space, then IOException is thrown. The getmerge internally uses IOUtils.copyBytes() (see IOUtils.copyBytes()) function to copy one file at a time from HDFS to local file. This function throws IOException whenever there is an error in the copy operation.

    2. This command is on similar lines as hdfs fs -get command which gets the file from HDFS to local file system. Only difference is hdfs fs -getmerge merges multiple files from HDFS to local file system.

    If you want to merge multiple files in HDFS, you can achieve it using copyMerge() method from FileUtil class (see FileUtil.copyMerge()).

    This API copies all files in a directory to a single file (merges all the source files).