Search code examples
kubernetes-helmfluxfluxcd

Is there a FluxCD equivalent to "argocd app wait" or "helm upgrade --wait"?


I did the following to deploy a helm chart (you can copy-and-paste my sequence of commands to reproduce this error).

$ flux --version
flux version 0.16.1

$ kubectl create ns traefik

$ flux create source helm traefik --url https://helm.traefik.io/traefik --namespace traefik

$ cat values-6666.yaml
ports:
  traefik:
    healthchecksPort: 6666   # !!! Deliberately wrong port number!!!

$ flux create helmrelease my-traefik --chart traefik --source HelmRepository/traefik --chart-version 9.18.2 --namespace traefik --values=./values-6666.yaml
✚ generating HelmRelease
► applying HelmRelease
✔ HelmRelease created
◎ waiting for HelmRelease reconciliation
✔ HelmRelease my-traefik is ready
✔ applied revision 9.18.2

So Flux reports it as a success, and can be confirmed like this:

$ flux get helmrelease --namespace traefik
NAME        READY   MESSAGE                             REVISION    SUSPENDED
my-traefik  True    Release reconciliation succeeded    9.18.2      False

But in fact, as shown above, values-6666.yaml contains a deliberately wrong port number 6666 for pod's readiness probe (as well as liveness probe):

$ kubectl -n traefik describe pod my-traefik-8488cc49b8-qf5zz
  ...
  Type     Reason    ... From     Message
  ----     ------    ... ----     -------
  Warning  Unhealthy ... kubelet  Liveness  probe failed: Get "http://172.31.61.133:6666/ping": dial tcp 172.31.61.133:6666: connect: connection refused
  Warning  Unhealthy ... kubelet  Readiness probe failed: Get "http://172.31.61.133:6666/ping": dial tcp 172.31.61.133:6666: connect: connection refused
  Warning  BackOff   ... kubelet  Back-off restarting failed container

My goal is to have FluxCD automatically detect the above error. But, as shown above, FluxCD deems it a success.

Either of the following deployment methods would have detected that failure:

$ helm upgrade --wait ...

or

$ argocd app sync ... && argocd app wait ...

So, is there something similar in FluxCD to achieve the same effect?

====================================================================

P.S. Flux docs here seems to suggest that the equivalent to helm --wait is already the default behaviour in FluxCD. My test above shows that it isn't. Furthermore, in the following example, I explicitly set it to disableWait: false but the result is the same.

$ cat helmrelease.yaml
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: my-traefik
  namespace: traefik
spec:
  chart:
    spec:
      chart: traefik
      sourceRef:
        kind: HelmRepository
        name: traefik
      version: 9.18.2
  install:
    disableWait: false      # !!! Explicitly set this flag !!!
  interval: 1m0s
  values:
    ports:
      traefik:
        healthchecksPort: 6666

$ kubectl -n traefik create -f helmrelease.yaml
helmrelease.helm.toolkit.fluxcd.io/my-traefik created

  ## Again, Flux deems it a success:
$ flux get hr -n traefik
NAME        READY   MESSAGE                             REVISION    SUSPENDED
my-traefik  True    Release reconciliation succeeded    9.18.2      False

  ## Again, the pod actually failed:
$ kubectl -n traefik describe pod my-traefik-8488cc49b8-bmxnv
... // Same error as earlier

Solution

  • Helm considers a deployment with one replica and strategy rollingUpdate with maxUnavailable of 1 to be ready when it has been deployed and there is 1 unavailable pod. If you test Helm itself, I believe you will find the same behavior exists in the Helm CLI / Helm SDK package upstream.

    (Even if the deployment's one and only pod has entered CrashLoopBackOff and readiness and liveness checks have all failed... with maxUnavailable of 1 and replicas of 1, the deployment technically has no more than the allowed number of unavailable pods, so it is considered ready.)

    This question was re-raised recently at: https://github.com/fluxcd/helm-controller/issues/355 and I provided more in-depth feedback there.

    Anyway, as for the source of this behavior which is seemingly/clearly not what the user wanted (even if it appears to be specifically what the user has asked for, which is perhaps debatable):

    As for Helm, this appears to be the same issue reported at GitHub here: