I have just started learning Hadoop,and still experimenting and trying to understand things, i am really curious about the usage of OutputCollector class collect() method, all the examples i have found since now are calling this method only once.If the calling cost of this method is really high (as it is writing output to the file) ? while thinking about the different scenarios i have got into the situation where i am finding the need of calling it more than once. like wise below is the given code snippet
public static class Reduce extends MapReduceBase implements
Reducer<IntWritable, Text, Text, NullWritable> {
public void reduce(IntWritable key, Iterator<Text> values,
OutputCollector<Text, NullWritable> output, Reporter reporter)
throws IOException {
Text outData = null;
while (values.hasNext()) {
outData = new Text();
outData.set(values.next().toString());
output.collect(outData, NullWritable.get());
}
}
}
as the values
object contains large number of records which mapper has emitted based on some filtering condition and i need to write those records to the output file.OR the other way around i could also use the below given approach.
public static class Reduce extends MapReduceBase implements
Reducer<IntWritable, Text, Text, NullWritable> {
public void reduce(IntWritable key, Iterator<Text> values,
OutputCollector<Text, NullWritable> output, Reporter reporter)
throws IOException {
StringBuilder sb = new StringBuilder();
while (values.hasNext()) {
sb.append(values.next().toString() + "\r\n ");
}
Text outData = new Text();
outData.set(sb.toString());
output.collect(outData, NullWritable.get());
}
}
However both approaches works fine on my singlenode setup for large input data-set of upto 400k records and values
object containing around 70k records. I want to ask which approach is better?And also will the above written code behave well on multinode cluster ? Any help appreciated. Thanks.
In the end it boils down how much data (in terms of size in bytes) you write.
Both solutions has some size overhead, in the first example you write multiple strings, you have the constant overhead of serializing the length of each string. In the other solution you write the same amount of overhead as your line separation.
So in byte sizes, both are equal, thus collecting the data should not be significantly slower in both solutions.
A very different part of your problem is the memory usage, think of a very large iteration of values, your StringBuilder
will be inefficient because of the resize operations and all the memory it uses. The collect
method is smarter and spills to disk if the write buffer is filled. On the other hand, if you have tons of available memory and you want to write a single huge record in one go- this might also be as efficient as setting the write buffer to be similarly sized.