Search code examples
daskdask-distributed

Dask scaling issue: Too many open files if increase the number of workers


I run SSH cluster from command line. Every node has 32 CPUs.

dask ssh --hostfile $PBS_NODEFILE --nworkers 32 --nthreads 1 &

The code:

import dask
from dask.distributed import Client

# items are individual molecules
# mol_dock is the function to process them (takes 1-20 min)
for future, res in as_completed(dask_client.map(mol_dock, items), with_results=True):
    ...  # process res

The mol_dock function runs in a subprocess shell a command which takes two input files and creates an output json file which is parsed by mol_dock function and results are returned back.

If I run code on 14 nodes it runs OK, if I choose more nodes it starts to produce errors like this below "Too many open files". This causes many failed calculations and their restarts. Finally all calculations are finished successfully, but the overhead due to restarts is substantial.

[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : 2023-04-24 17:26:08,591 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 20 memory: 254 MB fds:
 2048>>
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : Traceback (most recent call last):
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 443, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : KeyError: <function Process._parse_stat_file at 0x7f7f37502820>
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : 
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : During handling of the above exception, another exception occurred:
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : 
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : Traceback (most recent call last):
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/distributed/system_monitor.py", line 128, in update
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/__init__.py", line 999, in cpu_percent
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1645, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1836, in cpu_times
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1645, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 450, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1687, in _parse_stat_file
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 776, in bcat
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 764, in cat
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 728, in open_binary
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : OSError: [Errno 24] Too many open files: '/proc/18692/stat'

We increased soft and hard limit to 1000000 but this does not help. We did it through increasing limits in /etc/security/limits.conf for a particular user as suggested in FAQ.

It seems that I miss something. Is there some other tweaks or checks which are reasonable to try? Are there some other limits on the number of open files on Linux? Actually the same behavior is observed for the soft limit 100000. So, it seems to have no effect.


Solution

  • The issue was that limits were increased only on a main node of a cluster, but not on individual computational nodes. After fixing that everything started to work.