Search code examples
daskdask-kubernetes

dask-kubernetes zero workers on GKE


Noob here. I want to have a Dask install with a worker pool that can grow and shrink based on current demands. I followed the instructions in zero to jupyterhub to install on GKE, and then went through the install instructions for dask-kubernetes: https://kubernetes.dask.org/en/latest/.

I originally ran into some permissions issues, so I created a service account with all permissions and changed my config.yaml to use this service account. That got rid of the permissions issues, but now when I run this script, with the default worker-spec.yml, I get no workers:

cluster = KubeCluster.from_yaml('worker-spec.yml')
cluster.scale_up(4)  # specify number of nodes explicitly

client = distributed.Client(cluster)
client
Cluster

    Workers: 0
    Cores: 0
    Memory: 0 B

When I list my pods, I see a lot of workers in the pending state:

patrick_mineault@cloudshell:~ (neuron-264716)$ kubectl get pod --namespace jhub                                                                                                                   
NAME                          READY   STATUS    RESTARTS   AGE
dask-jovyan-24034fcc-22qw7w   0/1     Pending   0          45m
dask-jovyan-24034fcc-25h89q   0/1     Pending   0          45m
dask-jovyan-24034fcc-2bpt25   0/1     Pending   0          45m
dask-jovyan-24034fcc-2dthg6   0/1     Pending   0          45m
dask-jovyan-25b11132-52rn6k   0/1     Pending   0          26m
...

And when I describe each pod, I see that there's an insufficient memory, cpu error:

Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  69s (x22 over 30m)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.

Do I need to manually create a new autoscaling pool in GKE or something? I only have one pool now, the one which runs jupyterlab, and that pool is already fully committed. I can't figure out what piece of configuration causes dask to figure out in which pool to put the workers.


Solution

  • I indeed needed to create a flexible, scalable worker pool to host the workers - there's an example of this in the Pangeo setup guide: https://github.com/pangeo-data/pangeo/blob/master/gce/setup-guide/1_create_cluster.sh. This is the relevant line:

    gcloud container node-pools create worker-pool --zone=$ZONE --cluster=$CLUSTER_NAME \
        --machine-type=$WORKER_MACHINE_TYPE --preemptible --num-nodes=$MIN_WORKER_NODES