apache-spark emr amazon-emr apache-zeppelin

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that:

[
  {
    "Classification": "zeppelin-env",
    "Properties": {

    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
          "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks",
          "ZEPPELIN_NOTEBOOK_USER":"user"
        },
        "Configurations": [

        ]
      }
    ]
  }
]

I am pasting this object in the Stoftware configuration page of EMR: My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?

Solution

This is a bit involved, you will need to do 2 things:

Edit the interpreter.json of Zeppelin
Restart the interpreter

So what you need to do is write a shell script and then add an extra step to the EMR cluster configuration that runs this shell script.

The Zeppelin configuration is in json, you can use jq (a tool) to manipulate json. I don't know what you want to change exactly, but here is an example that adds the (mysteriously missing) DepInterpreter:

#!/bin/bash

# 1 edit the Spark interpreter
set -e
cat /etc/zeppelin/conf/interpreter.json | jq '.interpreterSettings."2ANGGHHMQ".interpreterGroup |= .+ [{"class":"org.apache.zeppelin.spark.DepInterpreter", "name":"dep"}]' | sudo -u zeppelin tee /etc/zeppelin/conf/interpreter.json


# Trigger restart of Spark interpreter
curl -X PUT http://localhost:8890/api/interpreter/setting/restart/2ANGGHHMQ

Put this shell script in a s3 bucket. Then start your EMR cluster with

--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/script.sh]