Search code examples
apache-sparkhadoop-yarndistributed-computingapache-zeppelin

How is Zeppelin integrated to spark?


I am very new to distributed data processing and would like to like to understand how Zeppelin is communicating with spark cluster? Wondering how is zeppelin able to retrieve data frames generated in previous paragraphs and then use it in the current code. Also what happens when multiple users try to use the same zeppelin, as in different notebooks connected to the same spark.

How does spark know which job to run first and does it keeps all the data frames in memory?

I am using YARN.


Solution

  • It looks like a very broad question. Let me answer one by one.

    A. Communications with external spark cluster.

    As you know Zeppelin provides the built-in spark, but it runs on local machine so that it can’t calculate large computation due to resource limitation.

    To Use external spark, you can set SPARK_HOME in conf/zeppelin-env.sh

    Sometimes, you might use multiple different spark clusters with one Zeppelin instance. In this case, you can create multiple spark interpreters and set SPARK_HOME for each spark interpreter setting.

    B. Yarn settings for Zeppelin

    You can specify yarn-client mode in the spark interpreter setting.

    For yarn-cluster mode, Please use livy interpreter

    C. Retrieve the data created in the previous paragraphs.

    • variable: by default, every variable can be accessible. Because they share the context. So if you create RDD, then you can access it from the other paragraphs (even from previous paras)
    • table: you can create a table using RDD and registerTempTable. Then just query the table in the next paragraph.

    These example notes can help

    D. Multiple users with same spark cluster

    By default, every user shares the variable and spark context and resources. This is not a good idea as you know. Thus, Zeppelin supports interpreter binding mode(= similar to session support) so that other users' action can’t affect on my notebook and my spark interpreter.

    In short, each user can have a dedicated spark interpreter process (JVM) in the isolated mode. Or They can share spark context while not sharing their variables in the scoped mode

    E. Configure multiple user supports.

    These articles can help you.