Search code examples
slurm

How enroot shares image cache and data in multi-node situations?


Currently, I have multiple GPU nodes and pool them through slurm. Enroot.conf adopts the default configuration. At this time, the image pulled by enroot can only be cached on the same node. When running a task on another node, you need to Re-pulling the image results in a waste of time.

enroot.conf :

#ENROOT_LIBRARY_PATH /usr/lib/enroot

#ENROOT_SYSCONF_PATH /etc/enroot

#ENROOT_RUNTIME_PATH ${XDG_RUNTIME_DIR}/enroot

#ENROOT_CONFIG_PATH ${XDG_CONFIG_HOME}/enroot

#ENROOT_CACHE_PATH ${XDG_CACHE_HOME}/enroot

#ENROOT_DATA_PATH ${XDG_DATA_HOME}/enroot

#ENROOT_TEMP_PATH ${TMPDIR:-/tmp}

I hope that when slurm submits a task, it can share the image cache regardless of the node, so as to reduce the time of pulling the image.


Solution

  • You can solve this problem by changing ENROOT_CACHE_PATH to point to a shared directory (if sharing across users) that is located on storage that is shared across your compute nodes. You may also need to ensure that the directory exists. You may need to inspect your Slurm prolog / epilog scripts to ensure that there aren't additional changes needed