Search code examples
apache-sparkparallel-processingmapreducespark-streamingdistributed-computing

DB access from a Mapper in MapReduce


I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce/Stream-Processing platforms like Flink, Spark Streaming etc.

For the analysis, the mappers must have DB access.

So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.

How should I handle that? Is it something I need to concern about?


Solution

  • As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.

    Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.

    Some ideas:

    • Periodically dump the meta-data to flat file/s into distributed file system
    • Streaming meta-data updates to your pipeline at write-time to keep an in-memory cache up-to-date
    • Use a separate mechanism to fetch the meta-data, for instance Akka Actor/s polling for changes

    It will depend on the trade-offs you are able to make for your given use-case.

    If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.