apache-spark pyspark scalability scheduler distributed-computing

How to create a custom apache Spark scheduler?

I have a p2p mesh network of nodes. It has its own balancing and given a task T can reliably execute it (if one node fails another will continue). My mesh network has Java and Python apis. I wonder what are the steps needed to make Spark call my API to lunch tasks?

Solution

Oh boy, that's a really broad question, but I agree with Daniel. If you really want to do this, you could first start with:

Scheduler Backends, which states things like:

Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an application" gets resource offers as machines become available and can launch tasks on them. Once a scheduler backend obtains the resource allocation, it can start executors.
TaskScheduler, since you need to understand how tasks are meant to be scheduled to build a scheduler, which mentions things like this:

A TaskScheduler gets sets of tasks (as TaskSets) submitted to it from the DAGScheduler for each stage, and is responsible for sending the tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers.

An important concept here is the Dependency Acyclic Graph (GDA), where you can take a look at its GitHub page.

You can also read What is the difference between FAILED AND ERROR in spark application states to get an intuition.
Spark Listeners — Intercepting Events from Spark can also come in handy:

Spark Listeners intercept events from the Spark scheduler that are emitted over the course of execution of Spark applications.

You could take first the Exercise: Developing Custom SparkListener to monitor DAGScheduler in Scala to check your understanding.

In general, Mastering Apache Spark 2.0 seems to have plenty of resources, but I will not list more here.

Then, you have to meet the Final Boss in this game, Spark's Scheduler GitHub page and get the feel. Hopefully, all this will be enough to get you started! :)