Search code examples
daskdask-distributed

Worker process still alive after 0 seconds, killing


I submit two Dask containers with my scheduler (PBS) like that :

#!/usr/bin/env bash

#PBS -N MyApp
#PBS -q my_queue
#PBS -l select=1:ncpus=1:mem=2GB
#PBS -l walltime=00:30:00
#PBS -m n

/.../bin/python -m distributed.cli.dask_worker tcp://scheduler:53815 --nanny --death-timeout 60

The first worker successfully connect to the scheduler :

distributed.nanny - INFO -         Start Nanny at: 'tcp://...:48652'
distributed.worker - INFO -       Start worker at:    tcp://...:33401
distributed.worker - INFO -          Listening to:    tcp://...:33401
distributed.worker - INFO -          dashboard at:          ...:54725
distributed.worker - INFO - Waiting to connect to:   tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   1.86 GiB
distributed.worker - INFO -       Local Directory: /.../
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://...:48652'
Terminated

(the signal 15 is OK. For REDHAT it means a simple SIGTERM, because I have terminated myself the container before it ends)

The problem is for the second worker :

The container of the worker is OK, but the worker never process any Dask tasks.

The logs are as follow :

distributed.nanny - INFO -         Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/.../site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/.../asyncio/tasks.py", line 468, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/.../asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.../runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/.../site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/.../site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/.../site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/.../site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/.../site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds

As you can see, the second worker seems to never listen. It do only nanny related things.

Do you have an idea, why the second worker never give up ?

Thank you

edit :

i have the same errors with HtCondor :

distributed.nanny - INFO -         Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/asyncio/tasks.py", line 466, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/asyncio/tasks.py", line 490, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/asyncio/tasks.py", line 492, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/site-packages/click/core.py", line 1126, in __call__
    return self.main(*args, **kwargs)
  File "/site-packages/click/core.py", line 1051, in main
    rv = self.invoke(ctx)
  File "/site-packages/click/core.py", line 1393, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/site-packages/click/core.py", line 752, in invoke
    return __callback(*args, **kwargs)
  File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/asyncio/tasks.py", line 688, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

Solution

  • It works with --no-dashboard option passed to any dask-worker

    https://github.com/dask/dask-jobqueue/issues/391#issuecomment-639257428