Search code examples
kuberneteskubeflow

Got error "Waited for xxx due to client-side throttling, not priority and fairness" when Kubeflow Training Operator


I have a local Kubernetes created by Rancher Desktop. I am trying to deploy based on the installation guide.

However, after deploying by

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.6.0"

The Kubeflow Training Operator pod is in a CrashLoopBackOff state, with the following log:

➜ kubectl logs training-operator-xxx -n kubeflow

I0714 04:54:03.434723       1 request.go:682] Waited for 1.024840626s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s
1.689310446978421e+09   INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": ":8080"}
I0714 04:54:14.225698       1 request.go:682] Waited for 1.047503167s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/node.k8s.io/v1?timeout=32s
I0714 04:54:24.275500       1 request.go:682] Waited for 1.948469293s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/artifact.apicur.io/v1alpha1?timeout=32s
I0714 04:54:34.325909       1 request.go:682] Waited for 2.849523377s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1?timeout=32s
I0714 04:54:44.724674       1 request.go:682] Waited for 1.047644251s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1?timeout=32s
I0714 04:54:54.774273       1 request.go:682] Waited for 1.947402376s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/elasticsearch.k8s.elastic.co/v1?timeout=32s

Any ideas? Thanks!


Solution

  • It turns out the Kubeflow Training Operator pod requires an additional startup time on their first initialization.

    So we can patch by adding startupProbe with higher failureThreshold.

    Here is the working version:

    kubectl apply --kustomize=kubeflow-training-operator
    
    • kubeflow-training-operator/kustomization.yaml

      resources:
        - github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.6.0
      patches:
        - path: training-operator-deployment-patch.yaml
      
    • kubeflow-training-operator/training-operator-deployment-patch.yaml

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: training-operator
      spec:
        template:
          spec:
            containers:
              - name: training-operator
                startupProbe:
                  httpGet:
                    path: /healthz
                    port: 8081
                  failureThreshold: 30
      

    Then I can see it got deployed properly:

    ➜ kubectl logs training-operator-xxx -n kubeflow
    
    I0714 05:41:03.499300       1 request.go:682] Waited for 1.041055501s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s
    1.6893132670509896e+09  INFO  controller-runtime.metrics  Metrics server is starting to listen  {"addr": ":8080"}
    I0714 05:41:14.295944       1 request.go:682] Waited for 1.048299708s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/registry.apicur.io/v1?timeout=32s
    I0714 05:41:24.296220       1 request.go:682] Waited for 1.898473292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/admissionregistration.k8s.io/v1?timeout=32s
    I0714 05:41:34.345574       1 request.go:682] Waited for 2.797829418s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/kafka.strimzi.io/v1beta1?timeout=32s
    I0714 05:41:44.795474       1 request.go:682] Waited for 1.048790793s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s
    I0714 05:41:54.845438       1 request.go:682] Waited for 1.945571376s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1alpha1?timeout=32s
    I0714 05:42:04.846740       1 request.go:682] Waited for 2.798720251s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/apm.k8s.elastic.co/v1beta1?timeout=32s
    I0714 05:42:15.295114       1 request.go:682] Waited for 1.047853292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/node.k8s.io/v1?timeout=32s
    1.689313340247459e+09 INFO  setup starting manager
    1.6893133402522147e+09  INFO  Starting server {"kind": "health probe", "addr": "[::]:8081"}
    1.6893133402523057e+09  INFO  Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
    1.6893133402535763e+09  INFO  Starting EventSource  {"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
    1.6893133402535298e+09  INFO  Starting EventSource  {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
    1.6893133402539213e+09  INFO  Starting EventSource  {"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
    1.6893133402539287e+09  INFO  Starting EventSource  {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
    1.6893133402534788e+09  INFO  Starting EventSource  {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
    1.689313340253958e+09 INFO  Starting EventSource  {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
    1.6893133402535348e+09  INFO  Starting EventSource  {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
    1.6893133402540202e+09  INFO  Starting EventSource  {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
    1.6893133402534842e+09  INFO  Starting EventSource  {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
    1.6893133402540386e+09  INFO  Starting EventSource  {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
    1.6893133402540817e+09  INFO  Starting EventSource  {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
    1.6893133402540936e+09  INFO  Starting Controller {"controller": "pytorchjob-controller"}
    1.6893133402541058e+09  INFO  Starting EventSource  {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
    1.689313340254115e+09 INFO  Starting Controller {"controller": "tfjob-controller"}
    1.6893133402534952e+09  INFO  Starting EventSource  {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
    1.6893133402541409e+09  INFO  Starting EventSource  {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
    1.6893133402541444e+09  INFO  Starting EventSource  {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
    1.6893133402542117e+09  INFO  Starting EventSource  {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
    1.6893133402542229e+09  INFO  Starting Controller {"controller": "mxjob-controller"}
    1.6893133402542326e+09  INFO  Starting EventSource  {"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
    1.6893133402542348e+09  INFO  Starting Controller {"controller": "paddlejob-controller"}
    1.6893133402542171e+09  INFO  Starting EventSource  {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
    1.6893133402542467e+09  INFO  Starting EventSource  {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
    1.6893133402542505e+09  INFO  Starting EventSource  {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
    1.6893133402542531e+09  INFO  Starting Controller {"controller": "mpijob-controller"}
    1.689313340254083e+09 INFO  Starting EventSource  {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
    1.6893133402546306e+09  INFO  Starting Controller {"controller": "xgboostjob-controller"}
    1.6893133403579748e+09  INFO  Starting workers  {"controller": "paddlejob-controller", "worker count": 1}
    1.6893133403599951e+09  INFO  Starting workers  {"controller": "xgboostjob-controller", "worker count": 1}
    1.6893133403601058e+09  INFO  Starting workers  {"controller": "pytorchjob-controller", "worker count": 1}
    1.6893133403601074e+09  INFO  Starting workers  {"controller": "mxjob-controller", "worker count": 1}
    1.6893133403601336e+09  INFO  Starting workers  {"controller": "tfjob-controller", "worker count": 1}
    1.6893133403601432e+09  INFO  Starting workers  {"controller": "mpijob-controller", "worker count": 1}
    

    References: