Search code examples
apache-sparkemramazon-emrapache-zeppelin

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster


I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that:

[
  {
    "Classification": "zeppelin-env",
    "Properties": {

    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
          "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks",
          "ZEPPELIN_NOTEBOOK_USER":"user"
        },
        "Configurations": [

        ]
      }
    ]
  }
]

I am pasting this object in the Stoftware configuration page of EMR: enter image description here My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?


Solution

  • This is a bit involved, you will need to do 2 things:

    1. Edit the interpreter.json of Zeppelin
    2. Restart the interpreter

    So what you need to do is write a shell script and then add an extra step to the EMR cluster configuration that runs this shell script.

    The Zeppelin configuration is in json, you can use jq (a tool) to manipulate json. I don't know what you want to change exactly, but here is an example that adds the (mysteriously missing) DepInterpreter:

    #!/bin/bash
    
    # 1 edit the Spark interpreter
    set -e
    cat /etc/zeppelin/conf/interpreter.json | jq '.interpreterSettings."2ANGGHHMQ".interpreterGroup |= .+ [{"class":"org.apache.zeppelin.spark.DepInterpreter", "name":"dep"}]' | sudo -u zeppelin tee /etc/zeppelin/conf/interpreter.json
    
    
    # Trigger restart of Spark interpreter
    curl -X PUT http://localhost:8890/api/interpreter/setting/restart/2ANGGHHMQ
    

    Put this shell script in a s3 bucket. Then start your EMR cluster with

    --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/script.sh]