Search code examples
cadence-workflowuber-cadence

Is there a Cadence metric that can help spot overloads for each specific activity worker?


My company would like to automatically scale the activity workers and each workflow workers independently according to the load of a tasklist.

Reading the docs I have found the following metrics for activity workers:

  • cadence_activity_scheduled_to_start_latency_bucket
  • cadence_activity_scheduled_to_start_latency_count
  • cadence_activity_scheduled_to_start_latency_sum

However these seem to be global metrics for activity workers. Is there a Cadence metric that would allow me to spot overloads for each specific activity worker?

Example: We have 4 different activity workers : A, B, C and D We would like to scale independently A or B or C or D without impacting the others


Solution

  • Understand scheduled_to_start_latency

    scheduled_to_start_latency is a measurement of the time from scheduled to started by worker. From scheduled to started, a task is transferred from matching service to an activity worker.

    These are the potential hotspots when this latency got high:

    How to monitor activity worker being overloaded

    • CPU/memory/Thread usage/Garbage collection of the activity worker is usually enough to make sure an worker is not overloaded
    • You can also use scheduled_to_start_latency, but the high latency could mean different things like above. Use other metrics to rule out the causes.