I'm evaluating Apache Spark to see if it's good platform for the following requirements:
Having read a bit about Spark, I see the following advantages:
These are the main questions I still have:
I also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs. If so, I'd also welcome a suggestion for an alternative. Many thanks!
We do use Spark in an interactive setting, as the backend of a web interface. Sub-second latencies are possible, but not easy. Some tips:
SparkContext
on start up. It takes a few seconds to get connected and get the executors started on the workers.SparkContext
and own set of executors, have just one that everyone can share. In our case multiple users can use the web interface concurrently, but there's only one web server.RDD.sample
liberally. An unbiased sample is often good enough for interactive exploration.Load balancing of operations is a good question. We will have to tackle this as well, but have not done it yet. In the default setup everything is processed in a first-in-first-out manner. Each operation gets the full resources of the cluster and the next operation has to wait. This is fine if each operation is fast, but what if one isn't?
The alternative fair scheduler likely solves this issue, but I have not tried it yet.
Spark can also off-load scheduling to YARN or Mesos, but I have no experience with this. I doubt they are compatible with your latency requirements.