Search code examples
apache-sparkhadoop-yarnemrapache-zeppelin

How to set up Zeppelin to work with remote EMR Yarn cluster


I have Amazon EMR Hadoop v2.6 cluster with Spark 1.4.1, with Yarn resource manager. I want to deploy Zeppelin on separate machine to allow turning off EMR cluster when there is no jobs running.

I tried following instruction from here https://zeppelin.incubator.apache.org/docs/install/yarn_install.html with not much of success.

Can somebody demystify steps how Zeppelin should connect to existing Yarn cluster from different machine?


Solution

  • [1] install Zeppelin with proper params:

    git clone https://github.com/apache/incubator-zeppelin.git ~/zeppelin;
    cd ~/zeppelin;
    mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests
    

    [2] Update EMR_MASTER EC2 security groups to accept incoming requests from all ports, to communicate with Zeppelin (should be specific port, not yet know which)

    [3] Copy directory EMR_MASTER:/etc/hadoop/conf to MY_STANDALONE_SERVER:/home/zeppelin/hadoop-conf.

    [4] zeppelin/conf/zeppelin-env.sh should contain:

    export MASTER=yarn-client
    export HADOOP_CONF_DIR=/home/zeppelin/hadoop-conf
    

    Note: Spark parameters like spark.executor.instances are taken from Interpreter settings, is specified there.