I would like to use the hadoop-streaming functionality with perl scripts as the mapper and reducer. I found out this explanation that partially answer my question, however it does not contain the functionality of the reducer handling all values together for each key.
For example the mapper might extract pairs, and the reducer will output the list of categories for each product. This is of course possible by saving all reducer data in memory (like in the example I mentioned before), but in many cases this is not scalable. Is there a way to let the perl script get all values for each key at once (like normal map-reduce jobs)?
You can use cpan library Hadoop::Streaming
sub reduce
{
my ( $self, $key, $value_iterator) = @_;
...
while( $value_iterator->has_next() ) { ... }
$self->emit( $key, $composite_value );
}