Search code examples
apache-sparkhdfsdistributed-computingbigdata

How spark select where to run w.r.t hdfs


When I started to use big data technologies, I learn that the fundamental rule is "move the code, not the data". But I realise I don't know how that works: how does spark know where to move the code?

I'm speaking here about the very first steps, eg: read from a distributed file and a couple of map ops.

  1. In case of a hdfs file, how does spark knows where the actual data parts are? What is the tool/protocol at work?
  2. Is it different depending on the resource manager (stand-alone-spark/yarn/mesos)?
  3. What about on-top-of-hdfs storage app, such as hbase/hive?
  4. what about other distributed storage if they are running on the same machines (such as kafka)?
  5. Apart from spark, is it the same for similar distributed engine, such as storm/flink?

edit

For cassandra + spark, it seems that the (specialized) connector manages this data locality: https://stackoverflow.com/a/31300118/1206998


Solution

  • 1) Spark asks Hadoop for how input files is distributed into splits (another good explanation on splits) and turns splits into partitions. Check code of Spark's NewHadoopRDD:

    override def getPartitions: Array[Partition] = {
      val inputFormat = inputFormatClass.newInstance
      inputFormat match {
        case configurable: Configurable =>
          configurable.setConf(_conf)
            case _ =>
          }
        val jobContext = newJobContext(_conf, jobId)
        val rawSplits = inputFormat.getSplits(jobContext).toArray
        val result = new Array[Partition](rawSplits.size)
        for (i <- 0 until rawSplits.size) {
          result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
        }
      result
    }
    

    2) It's not. It depends on Hadoop InputFormat of the file.

    3) The same.

    4) Mechanism is similar, for example KafkaRDD implementation maps Kafka partitions into Spark partitions one-to-one.

    5) I believe they use the same mechanism.