I have two MapReduce jobs, the output of the first Reducer is the input of the second Mapper :
Map1 -> Reduce1 -> Map2 -> Reduce2
For now Map2
reads from the files outputted by Reduce1
. So Map1 -> Reduce1
and Map2 -> Reduce2
are independant.
It works, but it would be easier and I think more optimized if the output of Reduce1 was directly the input of Map2.
Is there a way to do that ? In this case Map2
would be just an identity mapper so it would be even better if I could do :
Map1 -> Reduce1 -> Reduce2
Reduce1, Map2 and Reduce2 have the same input and output types.
Thanks !
According to my understanding these points i can tell you (may or may not help you, correct me if i am wrong):
1) map 1-> reduce 1-> directly to mapper2 : for optimizations are addressed in spark cluster computing framework (using in-memory computations, avoiding unnecessary read/writes to hdfs).
2) if you want something like reducer1 ->reducer2 . you have to think how you can write the logic in one reducer itself , but the problem here is its all depends on your requirement i mean the aggregation on which keys you want to perform (in more detail : reducer1 receives same set of key, on which only u can act the task of next aggregation).
3) In Hadoop the protocol is like this only : map --> then aggregation , if any next aggregation , it has to come from a Userdefinedmapper/IdentityMapper.
hope this helps :)