cadence-workflow uber-cadence

Cadence - Identifying important Operation metrics

I am doing some metrics collection and want to do some aggregations based on Operation.

What would you say are the top 5 (or more or less) operations across all services that we should be focusing on? OR
Are there top 5 (or more or less) for individual services? If yes, can you list them.

Thanks in advance.

Solution

First of all, this question is quite vague. I just made some for my own preference as minimum set of monitors.

Server metrics

You should monitor availability & latency of all APIs for every service, and persistence API.
You should monitor queue latency from history service -- this is the key metric to understand the background task perf which is missing from API availability & latency
You should make dashboard for API counters for each service so that you can see the load changing over the time

Client metrics

You should monitor on Workflow failure/timeout
You should monitor on Activity task failure/timeout
You should monitor decision task failure/timeout