I am having difficulty loading packages into R on my compute pool nodes using the Azure Batch Python API. The code that I am using is similar to what is provided in the Azure Batch Python SDK Tutorial, except the task is more complicated -- I want each node in the job pool to execute an R script which requires certain package dependencies.
Hence, in my start task commands below, I have each node (Canonical UbuntuServer SKU: 16) install R via apt and install R package dependencies (the reason why I added R package installation to the start task is that, even after creating a lib directory ~/Rpkgs
with universal permissions, running install.packages(list_of_packages, lib="~/Rpkgs/", repos="http://cran.r-project.org")
in the task script leads to "not writable" errors.)
task_commands = [
# Install pip
'curl -fSsL https://bootstrap.pypa.io/get-pip.py | python',
# Install the azure-storage module so that the task script can access Azure Blob storage, pre-cryptography version
'pip install azure-storage==0.32.0',
# Install R
'sudo apt -y install r-base-core',
'mkdir ~/Rpkgs/',
'sudo chown _azbatch:_azbatchgrp ~/Rpkgs/',
'sudo chmod 777 ~/Rpkgs/',
# Install R package dependencies
# *NOTE*: the double escape below is necessary because Azure strips the forward slash
'printf "install.packages( c(\\"foreach\\", \\"iterators\\", \\"optparse\\", \\"glmnet\\", \\"doMC\\"), lib=\\"~/Rpkgs/\\", repos=\\"https://cran.cnr.berkeley.edu\\")\n" > ~/startTask.txt',
'R < startTask.txt --no-save'
Anyhow, I confirmed in the Azure portal that these packages installed as intended on the compute pool nodes (you can see them located at startup/wd/Rpkgs/
, a.k.a. ~/Rpkgs/
, in the node filesystem). However, while the _R_TASK_SCRIPT
task was successfully added to the job pool, it terminated with a non-zero exit code because it wasn't able to load any of the packages (e.g. foreach
, iterators
, optparse
, etc.) that had been installed in the start task.
More specifically, the _R_TASK_SCRIPT
contained the following R code and returned the following output:
R code:
lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
R stderr, stderr.txt
on Azure Batch node:
Loading required package: iterators
Loading required package: foreach
Loading required package: optparse
Loading required package: glmnet
Loading required package: doMC
R stdout, stdout.txt
on Azure Batch node:
above indicates that it was not able to load the R package. This is the issue I'm facing, and I'd like to figure out why.
It may be noteworthy that, when I spin up a comparable VM (Canonical UbuntuServer SKU: 16) and run the same installation manually, it successfully loads all packages.
myusername@rnode:~$ pwd
myusername@rnode:~$ mkdir ~/Rpkgs/
myusername@rnode:~$ printf "install.packages( c(\"foreach\", \"iterators\", \"optparse\", \"glmnet\", \"doMC\"), lib=\"~/Rpkgs/\", repos=\"http://cran.r-project.org\")\n" > ~/startTask.txt
myusername@rnode:~$ R < startTask.txt --no-save
myusername@rnode:~$ R
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
Loading required package: iterators
Loading required package: foreach
Loading required package: optparse
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 2.0-10
Loading required package: doMC
Loading required package: parallel
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
Thanks in advance for your help and suggestions.
Each task runs on its own working directory which is referenced by the environment variable, $AZ_BATCH_TASK_WORKING_DIR
. When the R session runs, the current R working directory [ getwd() ] will be $AZ_BATCH_TASK_WORKING_DIR
where the pkgs lives.
To get the exact package location ("startup/wd/pkgs
") in the R code,
lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require,
character.only=TRUE, lib.loc=paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"),
"/wd/", "Rpkgs") )
Run this method before the lapply:
setwd(paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"), "/wd/"))
Added: You can also create a Batch pool of Azure data scientist virtual machines that has R already installed so you don't have to install it yourself.
Azure Batch has the doAzureParallel R package supports package installation. Here's a link: https://github.com/Azure/doAzureParallel (Disclaimer: I created the doAzureParallel R package)