I need to submit multiple simulations to condor (multi-client execution grid) using shell and since this may take a while, I decided to write a shell script to do it for me. I am very new to shell scripting and this is the result of what I did on one day:
for H in {0..50}
do
for S in {0..10}
do
./p32 -data ../data.txt -out ../result -position $S -group $H
echo "> Ready to submit"
condor_submit profile.sub
echo "> Waiting 15 minutes for group $H Pos $S"
for W in {1..15}
do
echo "Staring minute $W"
sleep 60
done
done
echo "Deleting data_3 to free up space"
mkdir /tmp/data_3
if [$H < 10]
then
tar cfvz /tmp/data_3/group_000$H.tar.gz ../result/data_3/group_000$H
rm -r ../result/data_3/group_000$H
else
tar cfvz /tmp/data_3/group_00$H.tar.gz ../result/data_3/group_00$H
rm -r ../result/data_3/group_00$H
fi
done
This script runs through 0..50 simulations and submits 0..10 different parameters to a program that generates a condor submission profile. Then I submit this profile and let it execute for 15 minutes (with a call being made every minute to ensure the SSH pipe doesn't break). Once the 15 minutes are up I compress the output to a volume with more space and erase the original files.
The reason for me implementing this because is due to our condor system can only being able to handle up to 10,000 submissions at once and one submission (condor_submit profile.sub
) executes 7000+ simulations.
Now my problem is with this line. When I checked this morning I (luckily) spotted that the when calling condor_submit profile.sub
may cause an error if the network is too busy. The error code is:
ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <IP_NUMBER:PORT_NUMBER>
This means that from time to time a whole iteration gets lost! How can I work around this? The only way I see is to use shell to read in the last line/s of terminal output and evaluate whether they follow the expected response i.e.:
7392 job(s) submitted to cluster CLUSTER_NUMBER.
But how would I read in the last line and go about checking for errors?
Any help is very needed and very much appreciated
Does condor_submit
give a non-zero exit code when it fails? If so, you can try calling it like this:
while ! condor_submit profile.sub; do
sleep 5
done
which will cause the current profile to be submitted every 5 seconds until it succeeds.