I am very new to distributed data processing and would like to like to understand how Zeppelin is communicating with spark cluster? Wondering how is zeppelin able to retrieve data frames generated in previous paragraphs and then use it in the current code. Also what happens when multiple users try to use the same zeppelin, as in different notebooks connected to the same spark.
How does spark know which job to run first and does it keeps all the data frames in memory?
I am using YARN.
It looks like a very broad question. Let me answer one by one.
A. Communications with external spark cluster.
As you know Zeppelin provides the built-in spark, but it runs on local machine so that it can’t calculate large computation due to resource limitation.
To Use external spark, you can set SPARK_HOME
in conf/zeppelin-env.sh
Sometimes, you might use multiple different spark clusters with one Zeppelin instance. In this case, you can create multiple spark interpreters and set SPARK_HOME
for each spark interpreter setting.
B. Yarn settings for Zeppelin
You can specify yarn-client
mode in the spark interpreter setting.
For yarn-cluster mode, Please use livy interpreter
C. Retrieve the data created in the previous paragraphs.
RDD
, then you can access it from the other paragraphs (even from previous paras)RDD
and registerTempTable
. Then just query the table in the next paragraph. These example notes can help
D. Multiple users with same spark cluster
By default, every user shares the variable and spark context and resources. This is not a good idea as you know. Thus, Zeppelin supports interpreter binding mode(= similar to session support) so that other users' action can’t affect on my notebook and my spark interpreter.
In short, each user can have a dedicated spark interpreter process (JVM) in the isolated mode. Or They can share spark context while not sharing their variables in the scoped mode
E. Configure multiple user supports.
These articles can help you.