First of all, this question is quite vague. I just made some for my own preference as minimum set of monitors.
Server metrics
- You should monitor availability & latency of all APIs for every service, and persistence API.
- You should monitor queue latency from history service -- this is the key metric to understand the background task perf which is missing from API availability & latency
- You should make dashboard for API counters for each service so that you can see the load changing over the time
Client metrics
- You should monitor on Workflow failure/timeout
- You should monitor on Activity task failure/timeout
- You should monitor decision task failure/timeout