Search code examples
apache-sparkrpc

Where is the RpcEnv instance in, Driver, Master or Worker?


Where is the RpcEnv instance in, and how every components get the corresponding rpcEnv instance? How do the components make connection to each other?


Solution

  • RpcEnv is an RPC Environment that is created separately for every component in Spark and is used to exchange messages between each other for remote communication.

    Spark creates the RPC environments for the driver and executors (by executing SparkEnv. createDriverEnv and SparkEnv.createExecutorEnv methods, respectively).

    SparkEnv.createDriverEnv is used exclusively when SparkContext is created for the driver:

    _env = createSparkEnv(_conf, isLocal, listenerBus)
    

    You can create a RPC Environment using RpcEnv.create factory methods yourself (as do ExecutorBackends, e.g. CoarseGrainedExecutorBackend):

    val env = SparkEnv.createExecutorEnv(
        driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)
    

    Separate RpcEnvs are also created for standalone Master and workers.


    How do the components make connection to each other?

    No much magic here :) The driver for a Spark application and the standalone Master for a Spark Standalone cluster are created first and they have no dependency on other components.

    When the driver of a Spark application starts, it requests resources (in the form of resource containers from a cluster manager) with the command to launch executors (that differs per cluster manager). In the launch command, there are connection details (i.e. host and port) of the driver's RpcEndpoint.

    See how it works with Hadoop YARN in Client.

    It is a similar process with standalone Workers with the difference that the administrator has to specify the master's URL at command line.

    $ ./sbin/start-slave.sh
    Usage: ./sbin/start-slave.sh [options] <master>
    
    Master must be a URL of the form spark://hostname:port
    
    Options:
      -c CORES, --cores CORES  Number of cores to use
      -m MEM, --memory MEM     Amount of memory to use (e.g. 1000M, 2G)
      -d DIR, --work-dir DIR   Directory to run apps in (default: SPARK_HOME/work)
      -i HOST, --ip IP         Hostname to listen on (deprecated, please use --host or -h)
      -h HOST, --host HOST     Hostname to listen on
      -p PORT, --port PORT     Port to listen on (default: random)
      --webui-port PORT        Port for web UI (default: 8081)
      --properties-file FILE   Path to a custom Spark properties file.
                               Default is conf/spark-defaults.conf.