Search code examples
javamapreducehdfshar

MapReduce job with HAR file output


I have multiple small input files. For running a map reduce job with a multiple input files, this will be the command:

hadoop jar <jarname> <packagename.classname> <input_dir> <output>

But if in case the above <output> is just a text file and should be a HAR file what will be the command such that all the output of the MapReduce job is a HAR archive?


Solution

  • The MapReduce job that you execute in your example can't write its output directly to a har file. Instead, you can run hadoop archive as a post-processing step after your MapReduce job to pack the MapReduce job output into a har file.

    > hadoop jar */share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /README.txt /wordcountout
    
    > hdfs dfs -ls /wordcountout
    Found 2 items
    -rw-r--r--   3 chris supergroup          0 2015-12-16 11:28 /wordcountout/_SUCCESS
    -rw-r--r--   3 chris supergroup       1306 2015-12-16 11:28 /wordcountout/part-r-00000
    
    > hadoop archive -archiveName wordcountout.har -p /wordcountout /archiveout
    
    > hdfs dfs -ls har:///archiveout/wordcountout.har
    Found 2 items
    -rw-r--r--   3 chris supergroup          0 2015-12-16 12:17 har:///archiveout/wordcountout.har/_SUCCESS
    -rw-r--r--   3 chris supergroup       1306 2015-12-16 12:17 har:///archiveout/wordcountout.har/part-r-00000
    

    You may optionally delete the original contents (the /wordcountout directory in my example) if having the data in har format alone is sufficient for your needs.

    Additional information about the hadoop archive command is available here:

    http://hadoop.apache.org/docs/r2.7.1/hadoop-archives/HadoopArchives.html