Search code examples
apache-flinksystem-design

Flink's TaskManager configuration


Wondering what the optimal configuration for task manager to cpu to slot per TM ratio would be and what are the pros and cons.

Say for a job with low compute requirements but extremely high data loads, would it be beneficial for a job manager to manage low amount of task managers, but each task manager having a high cpu count and high slot count? Or would it be better for many low cpu task managers with low slot count to exist instead?

I don't see a scenario where low cpu count high task manager count would be beneficial as it would simply incur more resource usage for the job manager, so I'm wondering what's the common sense design here for Flink tasks. Would it always be better to have a low amount of task managers but each with a high CPU and slot count?


Solution

  • My general rule of thumb is:

    1. One slot per CPU core (good starting point)
    2. Lots of CPU cores per server (fewer servers to configure)

    Of course item #2 doesn't matter so much this days when everything is a K8S pod. You still get slightly less network traffic when you have more slots/server, since inter-slot traffic within a server doesn't go through the network stack.