I have a task running on an EC2 cluster which starts to slow down progressively as virtual CPUs are employed (regardless of EBS volume size). To avoid this I want to disable hyperthreading on all nodes and was trying to implement the advice given here: https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/.
I am using Ray to launch the cluster in Ubuntu 18.04, and assumed that the initialization_commands section in the config.yaml file is the appropriate place to implement the bash commands (the bootcmd: heading is not understood there). I have tried a number of different formats but none seem to work; e.g.:-
# List of commands run before setup_commands.
initialization_commands:
- for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done
produces this error:-
bash: syntax error near unexpected token `sudo'
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Initialization commands completed [LogTimer=139ms]
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Applied config 39910e8bc12541ca5e316063231a2493642efee4 [LogTimer=60603ms]
2020-07-26 22:53:04,950 ERROR updater.py:348 -- NodeUpdater: i-0eefc0511ce029fb3: Error updating (Exit Status 1) ssh -i /home/haines/.ssh/ray-key2_us-east-1.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 ubuntu@3.93.77.73 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr '"'"','"'"' '"'"'\n'"'"' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 351, in run
raise e
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 341, in run
self.do_update()
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 426, in do_update
self.cmd_runner.run(cmd)
File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 263, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/haines/.ssh/ray-key2_us-east-1.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', 'ubuntu@3.93.77.73', 'bash', '--login', '-c', '-i', '\'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr \'"\'"\',\'"\'"\' \'"\'"\'\\n\'"\'"\' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done\'']' returned non-zero exit status 1.
2020-07-26 22:53:05,018 INFO log_timer.py:17 -- AWSNodeProvider: Set tag ray-node-status=setting-up on ['i-0eefc0511ce029fb3'] [LogTimer=205ms]
2020-07-26 22:53:05,140 ERROR commands.py:285 -- get_or_create_head_node: Updating 3.93.77.73 failed
I have tried using separate lines, and putting the commands in the setup_commands section instead, but none of these work. Is there an easier way?
Update: I guess that the syntax error may be to do with some spacing or characters (though I have tried many variants), but even without the loop, i.e. only the sudo echo command writing to one cpu, I get a permission error:-
bash: /sys/devices/system/cpu/cpu50/online: Permission denied
Update 2: I find that there is a simpler method: "export OMP_NUM_THREADS=1" but this seems to have no effect if done via a bash command in the setup. I am using Ray 0.8.6 which, I think, is supposed to set OMP_NUM_THREADS=1, but it isn't defined on the head-node when the cluster is up and running.
Well, setting OMP_NUM_THREADS seems to be useless. The solution was the first one, described by AWS, but it also required the addition of write permissions for all the CPU online flags, in the Ray configuration file:-
setup_commands:
- sudo chmod -R 777 /sys/devices/system/cpu/*
- for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done
This allows any number of tasks to run on all actual CPUs in the same time as just one. Of course, it also means that I have to run twice as many workers.