Search code examples
apache-flinkflink-streamingstartup

Flink standalone mode takes too long to start


I'm using Flink 1.18.1 with Flink operator 1.7.

The startup time (from pod creation to RUNNING state) is around 3 minutes. And if I have 2 JM and the leader is killed/restarts, the job take around 1:45 minutes to start.

This was quite okay so far, but I'm running a somewhat low latency job that requires this time to be snapier. Is there something around improving start time for Flink deployments?

What I use today:

  • Standalone mode
  • Kafka as source
  • HA k8s enabled
  • GCS as external storage system (checkpoints and savepoints)
  • k8s Flink operator

I don't seem to find any struggle on the logs, though.

Checked the logs of the application but nothing got my attention.


Solution

  • So I found out what was happening. Flink was using less than 0.05 vCPUs, but during startup time, it needed more CPU power to load the state, some extensions, and some tracing as well.

    The solution was to use the CPU limit factor for JM and TM. When the CPU request is less than 10% of a CPU, I configure Flink to a factor of 50x. And it decreases as the requested CPU increases.

    This changes the startup time from 1:45 min (to worst case 9m) to something around 20s (45s worst case).