Search code examples
architecturecloudcluster-computingapache-sparkplatform

Is Apache Spark good for lots of small, fast computations and a few big, non-interactive ones?


I'm evaluating Apache Spark to see if it's good platform for the following requirements:

  • Cloud computing environment.
  • Commodity hardware.
  • Distributed DB (e.g. HBase) with possibly a few petabytes of data.
  • Lots of simultaneous small computations that need to complete fast (within seconds). Small means 1-100 MBs of data.
  • A few large computations that don't need to complete fast (hours is fine). Large means 10-1000 GBs of data.
  • Very rarely, very large computations that don't need to complete fast (days is fine). Very large means 10-100 TBs of data.
  • All computations are mutually independent.
  • Real-time data stream incoming for some of the computations.
  • Machine learning involved.

Having read a bit about Spark, I see the following advantages:

  • Runs well on commodity hardware and with HBase/Cassandra.
  • MLlib for machine learning.
  • Spark Streaming for real-time data.
  • While MapReduce doesn't seem strictly necessary, maybe it could speed things up, and would let us adapt if the requirements became tighter in the future.

These are the main questions I still have:

  • Can it do small computations very fast?
  • Will it load-balance a large number of simultaneous small computations?

I also wonder if I'm generally not trying to use Spark for a purpose it wasn't designed for, not using the main advantages: MapReduce and in-memory RDDs. If so, I'd also welcome a suggestion for an alternative. Many thanks!


Solution

  • Small computations fast

    We do use Spark in an interactive setting, as the backend of a web interface. Sub-second latencies are possible, but not easy. Some tips:

    • Create SparkContext on start up. It takes a few seconds to get connected and get the executors started on the workers.
    • You mention many simultaneous computations. Instead of each user having their own SparkContext and own set of executors, have just one that everyone can share. In our case multiple users can use the web interface concurrently, but there's only one web server.
    • Operate on memory cached RDDs. Serialization is probably too slow, so use the default caching, not Tachyon. If you cannot avoid serialization, use Kryo. It is way faster than stock Java serialization.
    • Use RDD.sample liberally. An unbiased sample is often good enough for interactive exploration.

    Load balancing

    Load balancing of operations is a good question. We will have to tackle this as well, but have not done it yet. In the default setup everything is processed in a first-in-first-out manner. Each operation gets the full resources of the cluster and the next operation has to wait. This is fine if each operation is fast, but what if one isn't?

    The alternative fair scheduler likely solves this issue, but I have not tried it yet.

    Spark can also off-load scheduling to YARN or Mesos, but I have no experience with this. I doubt they are compatible with your latency requirements.