I submit two Dask containers with my scheduler (PBS) like that :
#!/usr/bin/env bash
#PBS -N MyApp
#PBS -q my_queue
#PBS -l select=1:ncpus=1:mem=2GB
#PBS -l walltime=00:30:00
#PBS -m n
/.../bin/python -m distributed.cli.dask_worker tcp://scheduler:53815 --nanny --death-timeout 60
The first worker successfully connect to the scheduler :
distributed.nanny - INFO - Start Nanny at: 'tcp://...:48652'
distributed.worker - INFO - Start worker at: tcp://...:33401
distributed.worker - INFO - Listening to: tcp://...:33401
distributed.worker - INFO - dashboard at: ...:54725
distributed.worker - INFO - Waiting to connect to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.86 GiB
distributed.worker - INFO - Local Directory: /.../
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://...:48652'
Terminated
(the signal 15 is OK. For REDHAT it means a simple SIGTERM, because I have terminated myself the container before it ends)
The problem is for the second worker :
The container of the worker is OK, but the worker never process any Dask tasks.
The logs are as follow :
distributed.nanny - INFO - Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/.../site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/.../asyncio/tasks.py", line 468, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/.../site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/.../asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/.../runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/.../site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/.../site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/.../site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.../site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/.../site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds
As you can see, the second worker seems to never listen
. It do only nanny
related things.
Do you have an idea, why the second worker never give up ?
Thank you
edit :
i have the same errors with HtCondor
:
distributed.nanny - INFO - Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/asyncio/tasks.py", line 466, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/asyncio/tasks.py", line 490, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/site-packages/click/core.py", line 1126, in __call__
return self.main(*args, **kwargs)
File "/site-packages/click/core.py", line 1051, in main
rv = self.invoke(ctx)
File "/site-packages/click/core.py", line 1393, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/site-packages/click/core.py", line 752, in invoke
return __callback(*args, **kwargs)
File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/asyncio/tasks.py", line 688, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
It works with --no-dashboard
option passed to any dask-worker
https://github.com/dask/dask-jobqueue/issues/391#issuecomment-639257428