Search code examples
cadence-workflowuber-cadence

Cadence - Identifying important Operation metrics


I am doing some metrics collection and want to do some aggregations based on Operation.

  1. What would you say are the top 5 (or more or less) operations across all services that we should be focusing on? OR
  2. Are there top 5 (or more or less) for individual services? If yes, can you list them.

Thanks in advance.


Solution

  • First of all, this question is quite vague. I just made some for my own preference as minimum set of monitors.

    Server metrics

    • You should monitor availability & latency of all APIs for every service, and persistence API.
    • You should monitor queue latency from history service -- this is the key metric to understand the background task perf which is missing from API availability & latency
    • You should make dashboard for API counters for each service so that you can see the load changing over the time

    Client metrics

    • You should monitor on Workflow failure/timeout
    • You should monitor on Activity task failure/timeout
    • You should monitor decision task failure/timeout