Search code examples
pythonpartitioningdistributed-computingslurmhpc

Distributing python code across nodes in slurm


I have a computationally expensive simulation function I am looking to distribute accross a multi-node cluster. The code looks something like this:

input_tasks = [input_0, input_1, ..., input_n]
for i in input_tasks:
    expensive_function(i)

I am running the code from a node with high compute and I am looking to distribute the function inputs to many nodes with varying compute power. The highest compute nodes should take priority and always pick up the next task if they are free. A pseudocode of what I wish to do is written below.

input_tasks = [input_0, input_1, ..., input_n]
available_nodes_ranked_by_compute = [node_0, node_1, ..., etc]
While(input_tasks): 
   i = input_tasks.pop(0)
   #get best current node or wait for a node to free up
   node_i = available_nodes_ranked_by_compute.pop(0)
   expensive_function(i, node_i)
   #add node back to avaiable node list when its done
   available_nodes_ranked_by_compute.append(node_i)
   #re-sort available nodes by compute 
   

I am relatively new to distributed computing and SLURM usage, so I am unsure how to check whether a particular node is being currently used. I want a way to maintain a dynamic list/heap that maintains the currrently unused nodes on the cluster so I can use it to execute all my tasks. Is there a basic way to do this?


Solution

  • There are several ways to run codes distributed on multiple nodes. Since you are doing simulation you can try mpirun, pathos. If you need GPU for simulation then you can look into Pytorch or Tensorflow APIs. Just replace all numpy operations with equivalent APIs.

    Since you are new to this, have a read through this tutorial. Personally, I prefer high level tools such as Submitit and Joblib with Hydra for machine learning.

    You need to find the right tool for you.