hadoop cassandra apache-spark hdfs distributed-computing

Data motion in Cassandra/HDFS and Spark

When designing a distributed storage and analytics architecture, is it a common usage pattern to run an analytics engine on the same machine as the data nodes? Specifically, would it make sense to run Spark/Storm directly on Cassandra/HDFS nodes?

I know that MapReduce on HDFS has this sort of usage pattern since according to Hortonworks, YARN minimizes data motion. I have no idea whether this is the case with these other systems though. I would imagine it is since they seem to be so pluggable with each other, but I can't seem to find any information about this online.

I'm sort of a newbie on this topic, so any resources or answers would be greatly appreciated.

Thanks

Solution

Yes it makes sense to run Spark on Cassandra nodes to minimize data movement between machines.

When you create an RDD from a Cassandra table, the RDD partitions will be created from the token ranges that are local to each machine.

Here's a link to a talk on this subject for the Spark Cassandra connector:

Cassandra and Spark: Optimizing for Data Locality

As it says in the summary: "There are only three things that are important in doing analytics on a distributed database: Locality, locality and locality."