Chaining Hadoop MapReduce with Pipes (C++)

Does anyone know how to chain two MapReduce with Pipes API? I already chain two MapReduce in a previous project with JAVA, but today I need to use C++. Unfortunately, I haven't seen any examples in C++.

Has someone already done it? Is it impossible?

Solution

I finally manage to make Hadoop Pipes works. Here some steps to make works the wordcount examples available in src/examples/pipes/impl/.

I have a working Hadoop 1.0.4 cluster, configured following the steps described in the documentation.

To write a Pipes job I had to include the pipes library that is already compiled in the initial package. This can be found in C++ folder for both 32-bit and 64-bit architecture. However, I had to recompile it, which can be done following those steps:

# cd /src/c++/utils
# ./configure
# make install

# cd /src/c++/pipes
# ./configure
# make install

Those two commands will compile the library for our architecture and create a ’install’ directory in /src/c++ containing the compiled files.

Moreover, I had to add −lssl and −lcrypto link flags to compile my program. Without them I encountered some authentication exception at the running time. Thanks to those steps I was able to run wordcount−simple that can be found in src/examples/pipes/impl/ directory.

However, to run the more complex example wordcount−nopipe, I had to do some other points. Due to the implementation of the record reader and record writer, we are directly reading or writing from the local file system. That’s why we have to specify our input and output path with file://. Moreover, we have to use a dedicated InputFormat component. Thus, to launch this job I had to use the following command:

# bin/hadoop pipes −D hadoop.pipes.java.recordreader=false −D hadoop.pipes.java.recordwriter=false −libjars hadoop−1.0.4/build/hadoop−test−1.0.4.jar −inputformat org.apache.hadoop.mapred.pipes.WordCountInputFormat −input file:///input/file −output file:///tmp/output −program wordcount−nopipe

Furthermore, if we look at org.apache.hadoop.mapred.pipes.Submitter.java of 1.0.4 version, the current implementation disables the ability to specify a non java record reader if you use InputFormat option. Thus you have to comment the line setIsJavaRecordReader(job,true); to make it possible and recompile the core sources to take into account this change (http://web.archiveorange.com/archive/v/RNVYmvP08OiqufSh0cjR).

if(results.hasOption("−inputformat")) { 
    setIsJavaRecordReader(job, true);
    job.setInputFormat(getClass(results, "−inputformat", job,InputFormat.class));
}