Search code examples
architecturerpcapache-spark

How to run an RPC server on Apache Spark?


I wonder if the following setup is possible with Apache Spark:

                    ___________________ RPC
Distributed DB  <-> | resident Backend <---> Frontend
                    |      server     |
                    ___________________
                           Spark

What I'm hoping to do is utilize Spark's MLlib & Spark Streaming in the backend, as well as take advantage of Spark's speed for my backend computations (statistics, machine learning).

Spark's architecture seems to require submitting computations one at a time, as Jars of compiled code. However, because the entire setup is for a multi-user web app & external API, it seems much more straightforward to have a long-running Backend server, communicating via RPCs with the Frontend.

Is this at all possible, without too much hacking? It would seem that the nature of Spark Streaming would necessitate having a resident server. Is JavaStreamingContext#awaitTermination() the only way to try to implement such an application?

One possible issue I see with what I'm trying to do, even if possible, is addressing such a Backend to establish a connection.

Many thanks!


Solution

  • Ha, I realized that Spark JobServer, which I initially labeled as "half-way there", is in fact a solution to the problem. (If there's more, or simpler - please post them too.)

    Spark JobServer wraps Spark and communicates with the outside world via HTTP in a RESTful style. There's a command to upload a Jar with some compiled computation code, and a separate command to execute any of the uploaded Jars, also providing input.

    So, the solution to my problem is to start Spark JobServer, upload the Jars for the computations I want my system to perform, and issue HTTP RPCs from the Frontend asking JobServer to launch the appropriate Jars on Spark.

    Details in JobServer README.