My question is about knowing a good choice for parallelism for operators in a flink job in a fixed cluster setting. Suppose, we have a flink job DAG containing map
and reduce
type operators with pipelined edges between them (no blocking edge). An example DAG is as follows:
Scan -> Keyword Search -> Aggregation
Assume a fixed size cluster of M
machines with C
cores each and the DAG is the only workflow to be run on the cluster. Flink allows the user to set the parallelism for individual operators. I usually set M*C
parallelism for each operator. But is this the best choice from performance perspective (e.g. execution time)? Can we leverage the properties of the operators to make a better choice? For example, if we know that aggregation
is more expensive, should we assign M*C
parallelism to only the aggregation
operator and reduce the parallelism for other operators? This hopefully will reduce the chances of backpressure too.
I am not looking for a proper formula that will give me the "best" parallelism. I am just looking for some kind of an intuition/guideline/ideas that can be used to make a decision. Surprisingly, I could not find much literature to read on this topic.
Note: I am aware of the dynamic scaling reactive mode in recent Flink. But my question is about a fixed cluster with only one workflow running, which means that the dynamic scaling is not relevant. I looked at this question, but did not get an answer.
I think about this a little differently. From my perspective, there are two key questions to consider:
(1) Do I want to keep the slots uniform? Or in other words, will each slot have an instance of every task, or do I want to adjust the parallelism of specific tasks?
(2) How many cores per slot?
My answer to (1) defaults to "keep things uniform". I haven't seen very many situations where tuning the parallelism of individual operators (or tasks) has proven to be worthwhile.
Changing the parallelism is usually counterproductive if it means breaking an operator chain. Doing it where's a shuffle anyway can make sense in unusual circumstances, but in general I don't see the point. Since some of the slots will have instances of every operator, and the slots are all uniform, why is it going to be helpful to have some slots with fewer tasks assigned to them? (Here I'm assuming you aren't interested in going to the trouble of setting up slot sharing groups, which of course one could do.) Going down this path can make things more complex from an operational perspective, and for little gain. Better, in my opinion, to optimize elsewhere (e.g., serialization).
As for cores per slot, many jobs benefit from having 2 cores per slot, and for some complex jobs with lots of tasks you'll want to go even higher. So I think in terms of an overall parallelism of M*C
for simple ETL jobs, and M*C/2
(or lower) for jobs doing something more intense.
To illustrate the extremes:
A simple ETL job might be something like
source -> map -> sink
where all of the connections are forwarding connections. Since there is only one task, and because Flink only uses one thread per task, in this case we are only using one thread per slot. So allocating anything more than one core per slot is a complete waste. And the task is probably i/o bound anyway.
At the other extreme, I've seen jobs that involve ~30 joins, the evaluation of one or more ML models, plus windowed aggregations, etc. You certainly want more than one CPU core handling each parallel slice of a job like that (and more than two, for that matter).
Typically most of the CPU effort goes into serialization and deserialization, especially with RocksDB. I would try to figure out, for every event, how many RocksDB state accesses, keyBy's, and rebalances are involved -- and provide enough cores that all of that ser/de can happen concurrently (if you care about maximizing throughput). For the simplest of jobs, one core can keep up. By the time to you get to something like a windowed join you may already be pushing the limits of what one core can keep up with -- depending on how fast your sources and sinks can go, and how careful you are not to waste resources.
Example: imagine you are choosing between a parallelism of 50 with 2 cores per slot, or a parallelism of 100 with 1 core per slot. In both cases the same resources are available -- which will perform better?
I would expect fewer slots with more cores per slot to perform somewhat better, in general, provided there are enough tasks/threads per slot to keep both cores busy (if the whole pipeline fits into one task this might not be true, though deserializers can also run in their own thread). With fewer slots you'll have more keys and key groups per slot, which will help to avoid data skew, and with fewer tasks, checkpointing (if enabled) will be a bit better behaved. Inter-process communication is also a little more likely to be able to take an optimized (in-memory) path.