Search code examples
roozieoozie-coordinator

How to Schedule Rscripts using Oozie


I am using Rhadoop on Hortonworks Sandbox to read data from HDFS to R and after reading that in R, I am performing certain operation on that file.

I want to schedule (daily, weekly, monthly) this R script using Oozie.

Any help is strongly appreciated.

Thanks


Solution

  • It seems somebody did it for you:

    Here are the relevant bash script and usage instruction from Oozie R helper on Github.

    #!/bin/bash
    
    die () {
        echo >&2 "$@"
        exit 1
    }
    
    [ "$#" -eq 3 ] || die "3 arguments required, $# provided"
    hdfs_file=$1
    r_file=$2
    hdfs_output=$3
    
    if [[  ${hdfs_output} =~ ^\/tmp\/.*$ ]]; then
        echo "I will run the r script $2 on the hdfs $1"
        tmp_filename="/tmp/`date +"%Y%m%d.%H%M%S"`"
        echo "using tmp_dir $tmp_filename"
        tmp_output="/tmp/out`date +"%Y%m%d.%H%M%S"`"
    
    
        hadoop fs -getmerge $hdfs_file $tmp_filename
        R -f $r_file --args $tmp_filename $tmp_output
        hadoop fs -rmr $hdfs_output
        hadoop fs -put $tmp_output $hdfs_output 
    
    else 
      die "$hdfs_output must be in /tmp/"
    fi
    

    Oozie R helper,

    The data science team wanted to be able to run R script using oozie,

    They wanted to be able to run ETL using Hive and on the result of that ETL they wanted to run the r script.

    So I created a bash script that take 3 arguments: 1. The HDFS input of the files they want to run 2. The R script they want to run 3. The output on the hdfs where they want their result to be. (currentlt because the user is mapred I allow only /tmp/)

    How to run

    You can use a shell oozie action like this:

    <shell xmlns="uri:oozie:shell-action:0.1">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <exec>run_r_hadoop.sh</exec>
                  <argument>/user/hive/warehouse/dual</argument>
                  <argument>count.r</argument>
                  <argument>/tmp/r_test</argument>
                <file>count.r#count.r</file>
    </shell>
    

    Prequesite

    R and all its libraries should be installe on all Hadoop salves, including all the libraries that are used