Search code examples
hadoopcassandraapache-sparkhdfsdistributed-computing

Data motion in Cassandra/HDFS and Spark


When designing a distributed storage and analytics architecture, is it a common usage pattern to run an analytics engine on the same machine as the data nodes? Specifically, would it make sense to run Spark/Storm directly on Cassandra/HDFS nodes?

I know that MapReduce on HDFS has this sort of usage pattern since according to Hortonworks, YARN minimizes data motion. I have no idea whether this is the case with these other systems though. I would imagine it is since they seem to be so pluggable with each other, but I can't seem to find any information about this online.

I'm sort of a newbie on this topic, so any resources or answers would be greatly appreciated.

Thanks


Solution

  • Yes it makes sense to run Spark on Cassandra nodes to minimize data movement between machines.

    When you create an RDD from a Cassandra table, the RDD partitions will be created from the token ranges that are local to each machine.

    Here's a link to a talk on this subject for the Spark Cassandra connector:

    Cassandra and Spark: Optimizing for Data Locality

    As it says in the summary: "There are only three things that are important in doing analytics on a distributed database: Locality, locality and locality."