Search code examples
javahadooprecursionmapreduce

MapReduce with Recursion


Consider the following problem:

EDIT: Ignore if the algorithm below doesn't make much sense. I just put it there for the sake of it. The idea is that doFunc is somehow recursive.

doFunc(A):
    [a0, a1, a2, ...] <- A
    If (someCondition([a0, a1, a2, ...]) == False)
        A <- modified(A)
        r = doFunc(modified(A))
        A <- convertR(r)
    B <- someFunc1(A)
    C <- someFunc2(B)
    r <- lastFunc(D)
    return r

In this case, r is the result of the recursive function doFunc where someCondition on the list of a0, a1, a2, ... is false, the function recurses to get some kind of an optimal A for which the condition is true.

Now consider that MapReduce could individually be applied to different parts of the program - say for example converting A to a0, a1, a2, ... and then to getting the modifiedA and then someFuncI are all possible using MapReduce, how does the recursion fit into this MapReduce implementation?

Considering this, Hadoop Streaming is kind of out of the question given I don't understand how to implement it with Recursion. The only other possibility is doing some form of Python Hadoop Streaming Wrapper for example dumbo or mrjob to write the code ignoring that there is recursion which is obviously going to unfold when doFunc is called recursively. I am wondering that how that factors in with MapReduce and what the scalability is like.

Questions: I have asked the questions in the text above but they might not be clear enough. So I'll have them laid in clear here.

  1. Does MapReduce behave well with Recursion?
  2. If so, does it scale well?
  3. Is there a way to implement Hadoop Streaming with functions involving recursion?

Solution

  • The only form of recursion which can be implemented in Hadoop is tail recursion which means that the recursive call must come at the end of the current call. Strictly speaking, recursion can't be emulated at all in Hadoop because the framework can't save the state of the current job while the next one (the recursive call) executes and, then, reload the current job and resume its execution. However, tail recursion can be simulated by chaining jobs, i.e. when one ends start the next one(s).

    I have successfully chained tens/hundreds of jobs. So there is no particular problem with fusing a few(probably even thousands) jobs in a sequence. However, there is a performance penalty associated with this practice due to 3 main reasons: setting up/tearing down jobs takes time, jobs might fail and need to be restarted, jobs might have slower machines which delay the termination of that job.

    But, apart from these details, what I think you should do is to make sure Hadoop is what you need. Hadoop is a pretty specialized framework in the sense that it addresses tasks which are "data parallelizeable" i.e. tasks which work on (usually) big data and which can be applied either to the entire data set at once or repeatedly to small chunks of that data and, in the end, achieve the same result as when applied to the whole data set. What you describe doesn't seem to fall in this category.