apache-spark parallel-processing mapreduce spark-streaming distributed-computing

DB access from a Mapper in MapReduce

I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce/Stream-Processing platforms like Flink, Spark Streaming etc.

For the analysis, the mappers must have DB access.

So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.

How should I handle that? Is it something I need to concern about?

Solution

As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.

Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.

Some ideas:

Periodically dump the meta-data to flat file/s into distributed file system
Streaming meta-data updates to your pipeline at write-time to keep an in-memory cache up-to-date
Use a separate mechanism to fetch the meta-data, for instance Akka Actor/s polling for changes

It will depend on the trade-offs you are able to make for your given use-case.

If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.