I'm using Flink 1.18.1 with Flink operator 1.7.
The startup time (from pod creation to RUNNING
state) is around 3 minutes. And if I have 2 JM and the leader is killed/restarts, the job take around 1:45 minutes to start.
This was quite okay so far, but I'm running a somewhat low latency job that requires this time to be snapier. Is there something around improving start time for Flink deployments?
What I use today:
I don't seem to find any struggle on the logs, though.
Checked the logs of the application but nothing got my attention.
So I found out what was happening. Flink was using less than 0.05 vCPUs, but during startup time, it needed more CPU power to load the state, some extensions, and some tracing as well.
The solution was to use the CPU limit factor for JM and TM. When the CPU request is less than 10% of a CPU, I configure Flink to a factor of 50x. And it decreases as the requested CPU increases.
This changes the startup time from 1:45 min (to worst case 9m) to something around 20s (45s worst case).