I am trying to port a MS VC++ program to run on a rocks cluster! I am not very good with linux but I am eager to learn and I imagine porting it wouldn't be an impossible task for me. However, I do not understand how to take advantage of the cluster nodes. because it seems that the code execute only runs on the front end server (obviously).
I have read a little about MPI and its seems like I should use MPI to comminicate between nodes. The program is currently written such that I have a main thread that synchronizes all worker threads. The main thread also recieves commands to manipulate the simulation or query its state. If the simulation is properly setup, communication between executing threads can be significantly minimized. What I don't understand is how do I start the process on the compute nodes and how do I handle failures in nodes? And maybe there should be other things I should also consider when porting my program to run in a cluster?
The first step is porting the threaded MS VC++ program to run on a single Linux machine.
Once you have gotten past that point, then modify your program to use MPI in addition to threads (or instead of threads). You can do this on a single computer as well.
To run the program on multiple nodes on your cluster, you will need to submit the program to whatever scheduling system you cluster uses. The command for this is dependent on the scheduling software used for your Rocks cluster. Ask your administrator. It may look something like mpirun -np 32 yourprogram
.
Handling failures is the nodes is a broad question. Your first pass should probably just report the failure, then fail the program. If the program doesn't take to long to compute on the cluster, then restarting the program, adjusting for the failed node, may be good enough. Beyond that, your application can write to disk intermediate information needed to resume where it left off. This is called checkpointing your application. Thus, when a node fails, the job fails, but restarting the job doesn't start from the beginning. Much more advanced would be trying to actually detect node failures and reschedule the work unit that was on the failed node. This assumes that the work unit doesn't have non-idempotent side effects. This sort of thing gets really complicated. Checkpointing is likely good enough.