Search code examples
arraysrbioinformaticssungridengine

SGE array jobs and R


I currently have a R script written to perform a population genetic simulation, then write a table with my results to a text file. I would like to somehow run multiple instances of this script in parallel using an array job (my University's cluster uses SGE), and when its all done I will have generated results files corresponding to each job (Results_1.txt, Results_2.txt, etc.).

Spent the better part of the afternoon reading and trying to figure out how to do this, but haven't really found anything along the lines of what I am trying to do. I was wondering if someone could provide and example or perhaps point me in the direction of something I could read to help with this.


Solution

  • To boil down mithrado's answer to the bare essentials:

    Create job script, pop_gen.bash, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:

    #!/bin/bash
    Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt
    

    Submit this script as a job array, e.g. 1000 jobs:

    qsub -t 1-1000 pop_gen.bash
    

    Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.

    Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R you can use SGE_TASK_ID to write to output file:

    args <- commandArgs(trailingOnly = TRUE)
    out.file <- paste("Results_", args[1], ".txt", sep="")
    # d <- "some data frame"
    write.table(d, file=out.file)
    

    HTH