Search code examples
hadoop-yarndask

Dask Yarn failed to allocate number of workers


We have a CDH cluster (version 5.14.4) with 6 worker servers with a total of 384 vcores (64 cores per server).
We are running some ETL processes using dask version 2.8.1, dask-yarn version 0.8 with skein 0.8 .
Currently we are having problem allocating the maximum number of workers .
We are not able to run a job with more the 18 workers! (we can see the actual number of workers in the dask dashboad.
The definition of the cluster is as follows:

cluster = YarnCluster(environment = 'path/to/my/env.tar.gz',
                      n_workers = 24,
                      worker_vcores = 4, 
                      worker_memory= '64GB'
                      )  

Even when increasing the number of workers to 50 nothing changes, although when changing the worker_vcores or worker_memory we can see the changes in the dashboard.

Any suggestions?

update

Following @jcrist answer I realized that I didn't fully understand the termenology between the Yarn web UI application dashboard and the Yarn Cluster parameters.

From my understanding:

  1. a Yarn Container is equal to a dask worker.
  2. When ever a Yarn cluster is generated there are 2 additional workers/containers that are running (one for a Schedualer and one for a logger - each with 1 vCore)
  3. The limitation between the n_workers * worker_vcores vs. n_workers * worker_memory that I still need fully grok.

There is another issue - while optemizing I tried using cluster.adapt(). The cluster was running with 10 workers each with 10 ntrheads with a limit of 100GB but in the Yarn web UI there was only displayed 2 conteiners running (my cluster has 384 vCorres and 1.9TB so there is still plenty of room to expand). probably worth to open a different question.


Solution

  • There are many reasons why a job may be denied more containers. Do you have enough memory across your cluster to allocate that many 64 GiB chunks? Further, does 64 GiB tile evenly across your cluster nodes? Is your YARN cluster configured to allow jobs that large in this queue? Are there competing jobs that are also taking resources?

    You can see the status of all containers using the ApplicationClient.get_containers method.

    >>> cluster.application_client.get_containers()
    

    You could filter on state REQUESTED to see just the pending containers

    >>> cluster.application_client.get_containers(states=['REQUESTED'])
    

    this should give you some insight as to what's been requested but not allocated.

    If you suspect a bug in dask-yarn, feel free to file an issue (including logs from the application master for a problematic run), but I suspect this is more an issue with the size of containers you're requesting, and how your queue is configured/currently used.