I wonder if the following setup is possible with Apache Spark:
___________________ RPC Distributed DB <-> | resident Backend <---> Frontend | server | ___________________ Spark
What I'm hoping to do is utilize Spark's MLlib & Spark Streaming in the backend, as well as take advantage of Spark's speed for my backend computations (statistics, machine learning).
Spark's architecture seems to require submitting computations one at a time, as Jars of compiled code. However, because the entire setup is for a multi-user web app & external API, it seems much more straightforward to have a long-running Backend server, communicating via RPCs with the Frontend.
Is this at all possible, without too much hacking? It would seem that the nature of Spark Streaming would necessitate having a resident server. Is JavaStreamingContext#awaitTermination() the only way to try to implement such an application?
One possible issue I see with what I'm trying to do, even if possible, is addressing such a Backend to establish a connection.
Many thanks!
Ha, I realized that Spark JobServer, which I initially labeled as "half-way there", is in fact a solution to the problem. (If there's more, or simpler - please post them too.)
Spark JobServer wraps Spark and communicates with the outside world via HTTP in a RESTful style. There's a command to upload a Jar with some compiled computation code, and a separate command to execute any of the uploaded Jars, also providing input.
So, the solution to my problem is to start Spark JobServer, upload the Jars for the computations I want my system to perform, and issue HTTP RPCs from the Frontend asking JobServer to launch the appropriate Jars on Spark.
Details in JobServer README.