I have been reading and trying to understand how does Spark framework use its cores in Standalone mode. According to Spark documentation, the parameter "spark.task.cpus"'s value is set to be 1 by default, which means number of cores to allocate for each task.
Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?
What will it happen if I set "spark.task.cpus = 16", more than the number of available hardware threads on this machine?
Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?
Maybe I am missing something. Is this related to the Scala language?
To answer your title question, Spark by itself does not give you parallelism gains within a task. The main purpose of the spark.task.cpus
parameter is to allow for tasks of multithreaded nature. If you call an external multithreaded routine within each task, or you want to encapsulate the finest level of parallelism yourself on the task level, you may want to set spark.task.cpus
to more than 1.
Setting this parameter to more than 1 is not something you would do often, though.
spark.task.cpus
to 3, only 2 tasks will get launched.spark.task.cpus=1
and experiencing some contention within the task still gives you more performance.spark.task.cpus
setting, because it'd probably be a much more static cost, that doesn't scale linearly with your task count.Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?
The JVM will almost always rely on the OS to provide it with info and mechanisms to work with CPUs, and AFAIK Spark doesn't do anything special here. If Runtime.getRuntime().availableProcessors()
or ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors()
return 4 for your dual-core HT-enabled Intel® processor, Spark will also see 4 cores.
Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?
Like mentioned above, Spark won't automatically parallelize a task according to the spark.task.cpus
parameter. Spark is mostly a data parallelism engine and its parallelism is achieved mostly through representing your data as RDDs.