Search code examples
pythonparallel-processinghpc

Specifying distribution of workers across nodes using scoop


Is there a way to specify the distribution of workers across nodes when running scoop programs on a HPC cluster?

I've only come across scoop recently and so far it seems like an excellent tool for quickly converting code designed to run using multiprocessing on a single compute node to code which will utilise multiple nodes simultaneously.

However is there a way to use scoop to run only one worker per compute node across a cluster in order to allow multi-threading at a deeper level in the code to run within each multi-core node?

I'm aware that one can specify the number of workers to initialise with the -n flag, or specify the specific hosts to connect to using either a host file or the --hosts flag (http://scoop.readthedocs.io/en/latest/usage.html#how-to-launch-scoop-programs) Is there a way to use a hostfile to this? And if so, how can this be done on a cluster with a scheduling system (e.g. in this case torque) which would normally allocate the nodes to the program?

If this cannot be done with scoop, can it be done with other packages (MPI, Parallel Python, pathos etc.)?


Solution

  • Just starting out with scoop myself.

    Seems you can do this by specifying the number of workers per host via the hostfile.

    The host file having the syntax:

    hostname_or_ip 4
    other_hostname
    third_hostname 2
    

    where the names are the system hostname or IP address and the number represents the number of workers to launch.

    See: https://scoop.readthedocs.io/en/0.7/usage.html#hostfile-format