Search code examples
hadoopmapreducedistributed-computingsdn

What will happen in hadoop, if input of one system is dependent on output some other system?


If in a Hadoop system the input of one system is dependent on the output of some other system then the parallel computation is not achieved.

Is there any way to solve this problem? Please provide an elaborated solution or any links to resources.


Solution

  • The question is a bit vague, but fortunately there is a generic answer.

    If you cannot do everything in one map-reduce stage, for instance because of dependencies, you can do it in multiple stages.

    A simple example would be:

    map-reduce-map-reduce


    Of course this has limitations, if all processing of line 2 depends on the final processing of line 1, then it is fundamentally impossible to process line 1 and line 2 in parallel.