Search code examples
pythonrazureazure-batch

Cannot load R packages on Azure Batch nodes


I am having difficulty loading packages into R on my compute pool nodes using the Azure Batch Python API. The code that I am using is similar to what is provided in the Azure Batch Python SDK Tutorial, except the task is more complicated -- I want each node in the job pool to execute an R script which requires certain package dependencies.

Hence, in my start task commands below, I have each node (Canonical UbuntuServer SKU: 16) install R via apt and install R package dependencies (the reason why I added R package installation to the start task is that, even after creating a lib directory ~/Rpkgs with universal permissions, running install.packages(list_of_packages, lib="~/Rpkgs/", repos="http://cran.r-project.org") in the task script leads to "not writable" errors.)

task_commands = [
    'cp -p {} $AZ_BATCH_NODE_SHARED_DIR'.format(_R_TASK_SCRIPT),
    # Install pip
    'curl -fSsL https://bootstrap.pypa.io/get-pip.py | python',
    # Install the azure-storage module so that the task script can access Azure Blob storage, pre-cryptography version
    'pip install azure-storage==0.32.0',
    # Install R
    'sudo apt -y install r-base-core',
    'mkdir ~/Rpkgs/',
    'sudo chown _azbatch:_azbatchgrp ~/Rpkgs/',
    'sudo chmod 777 ~/Rpkgs/',
    # Install R package dependencies
    # *NOTE*: the double escape below is necessary because Azure strips the forward slash
    'printf "install.packages( c(\\"foreach\\", \\"iterators\\", \\"optparse\\", \\"glmnet\\", \\"doMC\\"), lib=\\"~/Rpkgs/\\", repos=\\"https://cran.cnr.berkeley.edu\\")\n" > ~/startTask.txt',
    'R < startTask.txt --no-save'
    ]

Anyhow, I confirmed in the Azure portal that these packages installed as intended on the compute pool nodes (you can see them located at startup/wd/Rpkgs/, a.k.a. ~/Rpkgs/, in the node filesystem). However, while the _R_TASK_SCRIPT task was successfully added to the job pool, it terminated with a non-zero exit code because it wasn't able to load any of the packages (e.g. foreach, iterators, optparse, etc.) that had been installed in the start task.

More specifically, the _R_TASK_SCRIPT contained the following R code and returned the following output:

R code:

lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
...

R stderr, stderr.txt on Azure Batch node:

Loading required package: iterators
Loading required package: foreach
Loading required package: optparse
Loading required package: glmnet
Loading required package: doMC

R stdout, stdout.txt on Azure Batch node:

[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] FALSE

[[4]]
[1] FALSE

[[5]]
[1] FALSE

FALSE above indicates that it was not able to load the R package. This is the issue I'm facing, and I'd like to figure out why.

It may be noteworthy that, when I spin up a comparable VM (Canonical UbuntuServer SKU: 16) and run the same installation manually, it successfully loads all packages.

myusername@rnode:~$ pwd
/home/myusername
myusername@rnode:~$ mkdir ~/Rpkgs/
myusername@rnode:~$ printf "install.packages( c(\"foreach\", \"iterators\", \"optparse\", \"glmnet\", \"doMC\"), lib=\"~/Rpkgs/\", repos=\"http://cran.r-project.org\")\n" > ~/startTask.txt
myusername@rnode:~$ R < startTask.txt --no-save
myusername@rnode:~$ R

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
...
> lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
Loading required package: iterators
Loading required package: foreach
...
Loading required package: optparse
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 2.0-10

Loading required package: doMC
Loading required package: parallel
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

[[4]]
[1] TRUE

[[5]]
[1] TRUE

Thanks in advance for your help and suggestions.


Solution

  • Each task runs on its own working directory which is referenced by the environment variable, $AZ_BATCH_TASK_WORKING_DIR. When the R session runs, the current R working directory [ getwd() ] will be $AZ_BATCH_TASK_WORKING_DIR, not $AZ_BATCH_NODE_STARTUP_DIR where the pkgs lives.

    To get the exact package location ("startup/wd/pkgs") in the R code,

    lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, 
    character.only=TRUE, lib.loc=paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"), 
     "/wd/", "Rpkgs") )
    

    or

    Run this method before the lapply:

    setwd(paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"), "/wd/"))
    

    Added: You can also create a Batch pool of Azure data scientist virtual machines that has R already installed so you don't have to install it yourself.

    Azure Batch has the doAzureParallel R package supports package installation. Here's a link: https://github.com/Azure/doAzureParallel (Disclaimer: I created the doAzureParallel R package)