Search code examples
snappydata

Where do Spark components live in SnappyData Unified Cluster mode?


I'm trying to understand where all the "Spark" pieces fit into SnappyData's "Unified Cluster Mode" deployment topology.

In reading this, the documentation is unclear about a few things:

http://snappydatainc.github.io/snappydata/deployment/#unified-cluster-mode-aka-embedded-store-mode

  1. Who is the Master - Lead or Locator?
  2. Slave/Worker execute on... - Lead or Server?
  3. Executor execute on... - Server (This seemed straight forward in the docs)
  4. Apps execute on... - Lead or Server?
  5. Jobs execute on... - Lead or Server?
  6. Streams execute on... - Lead or Server?

Solution

  • SnappyData is a peer to peer cluster and does its own cluster management. Hence it does not need cluster managers like Spark standalone cluster manager and Yarn to start/stop Spark drivers and executors. When SnappyData lead node is started, it starts a Spark driver inside it and Spark executors are started inside all SnappyData servers. Now answers to your questions:

    Who is the Master - Lead or Locator?

    SnappyData does not have a Master.

    Slave/Worker execute on... - Lead or Server?

    SnappyData does not have a slave/worker.

    Executor execute on... - Server (This seemed straight forward in the docs)

    Correct.

    Apps execute on... - Lead or Server? Jobs execute on... - Lead or Server?

    An application in Spark is a self contained set of computations. For every Spark application, a driver is launched that starts a Spark context. The Spark context coordinates the application. The context and driver goes away after the application ends. In a sense, SnappyData can be called as a long running Spark app because it starts a Spark context and keeps it running inside the lead node. The SnappyData jobs can be submitted to lead node which are executed by the already running Spark context. So, lead node (Spark driver) schedules jobs on servers (Spark executors) which actually executes the tasks.

    Streams execute on... - Lead or Server?

    Spark execution is unchanged. When a stream job is submitted to the lead node, it creates a receiver on one of the available servers and then schedules the jobs for the received data.