Search code examples
linuxrcondor

Sandboxing R for Condor (on Linux)


My university runs a condor computing grid (compute nodes are running Linux), and I'd like to use it for running simulations in R. The problem is that only some of the machines on the grid have R installed. So far I see two options, but I don't know how to implement either one, so I hope you'll help me (keeping in mind that I'm not a sysadmin and can't do much to change the setup of the compute nodes):

1) Put a check in the ClassAds that go out with my condor submit file to require that the job be computed on nodes that have a /usr/bin/R.

2) Package R and all of its dependencies into a self-contained directory that can be sent out to the compute nodes and against which my simulation can be run. I've tried for several hours to do this, but the Linux version of R (unlike the OSX and Windows versions) seems to run against libraries that are distributed across the filesystem, and I can't think of a practical way to gather them all into a location where R can find them.

Any ideas? Thanks in advance.


Solution

  • What eventually worked for me was proposed solution (1). Here I discuss how I implemented (1) in my condor submit file and my worker shell script.

    Here's the shell script. The important change was to check whether R is installed on the compute node via: if [ -f /usr/bin/R ]. If R is found, we go down a path that ends in a return value of 0. If R is not found, we return 1 (that's the meaning of the lines exit 0 and exit 1).

    mkdir output
    if [ -f /usr/bin/R ]
    then
        if $(uname -m |grep '64')
        then
                Rscript code/simulations-x86_64.r $*
        else
                Rscript code/simulations-i386.r $*
        fi
    
        tar -zcvf output/output-$1-$2.tgz2 output/*.csv
        exit 0
    else
        exit 1
    fi
    

    Now the condor submit file. The crucial change was the second-to-last line (on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)). It checks the return value of each job from the compute node - if the return value is not zero (i.e. if R wasn't found on the compute node), then the job is put back into the queue to be re-run. Otherwise, the job is considered finished and is removed from the queue.

    universe = vanilla
    log = logs/log_$(Cluster)_$(Process).log
    error = logs/err_$(Cluster)_$(Process).err
    output = logs/out_$(Cluster)_$(Process).out
    executable = condor/worker.sh
    arguments = $(Cluster) $(Process)
    requirements = (Target.OpSys=="LINUX" && regexp("stat", Machine))
    should_transfer_files = YES
    when_to_transfer_output = ON_EXIT_OR_EVICT
    transfer_input_files = code, R-libs, condor, seeds.csv
    transfer_output_files = output
    notification = Never
    on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
    queue 1800