When I started to use big data technologies, I learn that the fundamental rule is "move the code, not the data". But I realise I don't know how that works: how does spark know where to move the code?
I'm speaking here about the very first steps, eg: read from a distributed file and a couple of map ops.
For cassandra + spark, it seems that the (specialized) connector manages this data locality: https://stackoverflow.com/a/31300118/1206998
1) Spark asks Hadoop for how input files is distributed into splits (another good explanation on splits) and turns splits into partitions. Check code of Spark's NewHadoopRDD:
override def getPartitions: Array[Partition] = {
val inputFormat = inputFormatClass.newInstance
inputFormat match {
case configurable: Configurable =>
configurable.setConf(_conf)
case _ =>
}
val jobContext = newJobContext(_conf, jobId)
val rawSplits = inputFormat.getSplits(jobContext).toArray
val result = new Array[Partition](rawSplits.size)
for (i <- 0 until rawSplits.size) {
result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
}
result
}
2) It's not. It depends on Hadoop InputFormat of the file.
3) The same.
4) Mechanism is similar, for example KafkaRDD implementation maps Kafka partitions into Spark partitions one-to-one.
5) I believe they use the same mechanism.