Search code examples
amazon-web-serviceskubernetesamazon-eks

Best Practices for Monitoring EKS Cluster with Prometheus and Grafana


I have recently set up Prometheus and Grafana for monitoring an EKS cluster, and I'm looking to ensure the best practices for reliability and data persistence. The EKS cluster team promises a 99.95% SLA So no need to worry controlplane crashes, and I'm trying to address potential concerns:

Data Persistence:

In case of node or cluster crashes, is there a risk of losing Prometheus and Grafana data? How can I ensure data persistence in such scenarios? what is the best practice for monitoring tool of cluster running on inside the cluster or outside ?

AWS Managed Node Groups:

I am considering creating a dedicated node group for Prometheus and Grafana pods using AWS Managed Node Groups. Does this align with best practices, or are there alternative approaches recommended for running monitoring tools?

High Availability:

To ensure high availability of monitoring services, what practices should be in place? Are there considerations for multi-AZ deployments or strategies to minimize potential downtime?

I want to implement the best practices to ensure the reliability and resilience of my monitoring setup. Any insights or recommendations from experienced EKS users would be greatly appreciated.


Solution

  • Prometheus-community helm chart (https://prometheus-community.github.io/helm-charts/) for prometheus is a great option for collecting metrics from EKS cluster. It creates a deamonset (prometheus-server) which ensures one pod will be running on each of EKS cluster nodes to collect the metrics from the respective node. Of course the helm chart starts other pods as well along with prometheus-server deamonset.

    This helm chart also creates two persistant volumes(pv). One for storing metrics collected by each of the prometheus-server deamonset and Other pv for prometheus alert manager pod. these volumes are outside of your EKS cluster nodes, so even if any node in cluster goes down, metrics will be persisted on the PV.

    The problem with keeping Grafana with in the EKS cluster nodes is that if any node in the cluster goes down due to some reason, it will impact your Grafana dashboards. May be keeping Grafana outside your EKS cluster would a better option. Even if EKS cluster nodes goes down, you can still access Grafana and identify that something is wrong.

    As you have already mentioned for High availability multi-AZ deployments would be a great choice.