I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce
/Stream-Processing
platforms like Flink
, Spark Streaming
etc.
For the analysis, the mappers must have DB access.
So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.
How should I handle that? Is it something I need to concern about?
As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.
Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.
Some ideas:
It will depend on the trade-offs you are able to make for your given use-case.
If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.