Search code examples
kubernetesmetricskubernetes-go-client

How can I get the prometheus metrics for informer watch operation latency?


I use the watch mechanism to monitor resources in another cluster, such as Pods under a specific namespace. I want to promptly detect the health status of the watch connection, including connection latency and whether the connection has been disconnected. I noticed that error logs are generated within 30 seconds after disconnection, but I haven't found any relevant metrics that can be collected.

I want to expose the metrics about the latency of the watch operation. After read client-go and component-base source code, I still couldn’t figure out what to do (solutions using other pre-packaged libraries are fine for me as well). Or is there a way for me to directly monitor the latency to the target cluster?

For the first time I tried to add observe here.

informer.Informer().AddEventHandler(cache.FilteringResourceEventHandler{
    FilterFunc: func(obj interface{}) bool {
        // Update last synced time.
        UpdateLastSyncedTime()
        return true
    },
}

But after discussion I found it might not be a good approach. For the reason, the update function is triggered by informer events, which originate from changes to items in the cluster or periodic resyncs. For the approach mentioned earlier to work, it would require reducing the resync interval, which could lead to inefficiencies.


Solution

  • At the top level of your program, when you call kubernetes.NewForConfig(), it takes a *rest.Config as a parameter. Typical client programs just pass on the configuration from a factory function that reads a kubeconfig file, but you can modify it. In particular, (*rest.Config).Wrap() lets you inject your own wrapper object that runs at the HTTP layer.

    If you wanted to, for example, time an HTTP call, then you could write

    type HttpTimer struct {
      rt http.RoundTripper
    }
    
    func (ht HttpTimer) RoundTrip(req *http.Request) (*http.Response, error) {
      before := time.Now()
      resp, err := ht.rt.RoundTrip(req)
      duration := time.Since(before)
      // handle or report errors; retry; report duration; ...
    }
    
    loadingRules := clientcmd.NewDefaultClientConfigLoadingRules()
    configOverrides := &clientcmd.ConfigOverrides{}
    kubeConfig := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(loadingRules, configOverrides)
    restConfig := kubeConfig.ClientConfig()
    
    // add to top-level setup
    restConfig.Wrap(func (rt http.RoundTripper) http.RoundTripper {
      return HttpTimer{rt}
    })
    
    clientSet := kubernetes.NewForConfig(restConfig)
    

    Since this glues into the HTTP layer, this might not have the specific things you're looking for. The informer interface probably uses the Kubernetes watch API. This has a long-running connection that gives back a stream of responses. That means you might see one HTTP connection, lasting several minutes, that's actually given back dozens of responses. This also means that "latency" as such isn't a meaningful concept. This would still work to measure the round-trip time for individual .Get() and .Update() operations though.