I am trying to build a script that runs mpi jobs in batch mode at certain hours. If I run in a console mpdallexit
, mpdboot
and mpirun
everything works fine and the parallel jobs start on all nodes in mpd.hosts. But if I try to run from a bash script (sent with at script now +1 minute) the mpd crashes and no jobs are started.
This are the relevant lines in the script
$path_mpi/mpdallexit
$path_mpi/mpdboot -n 5 &
time $path_mpi/mpirun -n 21 ./rams60 -f RAMSIN.operatiu
$path_mpi/mpdallexit
and the error messages from log
mpiexec_ventus: cannot connect to local mpd (/tmp/mpd2.console_meteo); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
In case 1, you can start an mpd on this host with:
mpd &
and you will be able to run jobs just on this host.
For more details on starting mpds on a set of hosts, see
the MPICH2 Installation Guide.
I have tried different options from mpdboot
--loccons says you do not want a console available on local mpd(s)
--remcons says you do not want consoles available on remote mpd(s)
or
mpdboot -n 5 &
but without success
Mpich installed at /usr/local/mpich2-1.0.5p4/
EDIT 1:
After trying @shellter advice on sleep I couldn't run the parallel jobs nor with at
neither cron
. When issuing a batch mpirun
job some processes start on the master node but not in the other cluster nodes:
In the master node
ps -ef | grep rams meteo 28043 26837 0 Apr21 ? 00:00:00 time /usr/bin/mpirun -n 50 -f machinefile ./rams60 -f RAMSIN.operatiu meteo 28044 28043 0 Apr21 ? 00:00:00 /usr/bin/mpirun -n 50 -f machinefile ./rams60 -f RAMSIN.operatiu meteo 28050 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28051 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28052 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28053 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28054 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28055 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28056 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28057 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28058 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28059 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28060 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu meteo 28061 28045 0 Apr21 ? 00:00:00 ./rams60 -f RAMSIN.operatiu
Besides, no output files are created by rams60 while the first thing it does is to write start analysis files.
Everything runs fine if I execute the script in the command line but it seems that mpich can not communicate with the nodes when in batch.
At first I installed mpich2 in the master node and NFS exported to the other nodes. Now I have installed mpich2 in every node.
Thanks
Thanks in advance
Finally I could resolve the issue with cron job thanks to Gilles Goullardet in the "mpich-discuss" mailing list.
The problem came from the environment in which batch jobs are run. Cron uses a minimal enviroment so some libs needed for my job were not found in the cluster nodes. I've had to add a line to my script exporting some libs:
export LD_LIBRARY_PATH=/usr/local/mpich2-1.0.5p4/lib:/usr/local/hdf5/lib:$LD_LIBRARY_PATH
Now everything is working fine and my script runs twice a day as desired. Thank you all for your help, in the process I've learned some thins about cron.