My university runs a condor computing grid (compute nodes are running Linux), and I'd like to use it for running simulations in R. The problem is that only some of the machines on the grid have R installed. So far I see two options, but I don't know how to implement either one, so I hope you'll help me (keeping in mind that I'm not a sysadmin and can't do much to change the setup of the compute nodes):
1) Put a check in the ClassAds that go out with my condor submit file to require that the job be computed on nodes that have a /usr/bin/R
.
2) Package R and all of its dependencies into a self-contained directory that can be sent out to the compute nodes and against which my simulation can be run. I've tried for several hours to do this, but the Linux version of R (unlike the OSX and Windows versions) seems to run against libraries that are distributed across the filesystem, and I can't think of a practical way to gather them all into a location where R can find them.
Any ideas? Thanks in advance.
What eventually worked for me was proposed solution (1). Here I discuss how I implemented (1) in my condor submit file and my worker shell script.
Here's the shell script. The important change was to check whether R is installed on the compute node via: if [ -f /usr/bin/R ]
. If R is found, we go down a path that ends in a return value of 0. If R is not found, we return 1 (that's the meaning of the lines exit 0
and exit 1
).
mkdir output
if [ -f /usr/bin/R ]
then
if $(uname -m |grep '64')
then
Rscript code/simulations-x86_64.r $*
else
Rscript code/simulations-i386.r $*
fi
tar -zcvf output/output-$1-$2.tgz2 output/*.csv
exit 0
else
exit 1
fi
Now the condor submit file. The crucial change was the second-to-last line (on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
). It checks the return value of each job from the compute node - if the return value is not zero (i.e. if R wasn't found on the compute node), then the job is put back into the queue to be re-run. Otherwise, the job is considered finished and is removed from the queue.
universe = vanilla
log = logs/log_$(Cluster)_$(Process).log
error = logs/err_$(Cluster)_$(Process).err
output = logs/out_$(Cluster)_$(Process).out
executable = condor/worker.sh
arguments = $(Cluster) $(Process)
requirements = (Target.OpSys=="LINUX" && regexp("stat", Machine))
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = code, R-libs, condor, seeds.csv
transfer_output_files = output
notification = Never
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
queue 1800