Search code examples
shellerror-handlingterminalcondor

Use shell output for error handling for condor


I need to submit multiple simulations to condor (multi-client execution grid) using shell and since this may take a while, I decided to write a shell script to do it for me. I am very new to shell scripting and this is the result of what I did on one day:

for H in {0..50}
do
    for S in {0..10}
    do
        ./p32 -data ../data.txt -out ../result -position $S -group $H
        echo "> Ready to submit"
        condor_submit profile.sub
        echo "> Waiting 15 minutes for group $H Pos $S"
        for W in {1..15}
        do
            echo "Staring minute $W"
            sleep 60
        done
    done

    echo "Deleting data_3 to free up space"
    mkdir /tmp/data_3
    if [$H < 10]
        then
            tar cfvz /tmp/data_3/group_000$H.tar.gz ../result/data_3/group_000$H
            rm -r ../result/data_3/group_000$H
        else
            tar cfvz /tmp/data_3/group_00$H.tar.gz ../result/data_3/group_00$H
            rm -r ../result/data_3/group_00$H
    fi
done

This script runs through 0..50 simulations and submits 0..10 different parameters to a program that generates a condor submission profile. Then I submit this profile and let it execute for 15 minutes (with a call being made every minute to ensure the SSH pipe doesn't break). Once the 15 minutes are up I compress the output to a volume with more space and erase the original files.

The reason for me implementing this because is due to our condor system can only being able to handle up to 10,000 submissions at once and one submission (condor_submit profile.sub) executes 7000+ simulations.

Now my problem is with this line. When I checked this morning I (luckily) spotted that the when calling condor_submit profile.sub may cause an error if the network is too busy. The error code is:

ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <IP_NUMBER:PORT_NUMBER>

This means that from time to time a whole iteration gets lost! How can I work around this? The only way I see is to use shell to read in the last line/s of terminal output and evaluate whether they follow the expected response i.e.:

7392 job(s) submitted to cluster CLUSTER_NUMBER.

But how would I read in the last line and go about checking for errors?

Any help is very needed and very much appreciated


Solution

  • Does condor_submit give a non-zero exit code when it fails? If so, you can try calling it like this:

    while ! condor_submit profile.sub; do
        sleep 5
    done
    

    which will cause the current profile to be submitted every 5 seconds until it succeeds.