multithreading apache-spark parallel-processing multicore

How does Spark achieve parallelism within one task on multi-core or hyper-threaded machines

I have been reading and trying to understand how does Spark framework use its cores in Standalone mode. According to Spark documentation, the parameter "spark.task.cpus"'s value is set to be 1 by default, which means number of cores to allocate for each task.

Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?

What will it happen if I set "spark.task.cpus = 16", more than the number of available hardware threads on this machine?

Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?

Maybe I am missing something. Is this related to the Scala language?

Solution

To answer your title question, Spark by itself does not give you parallelism gains within a task. The main purpose of the spark.task.cpus parameter is to allow for tasks of multithreaded nature. If you call an external multithreaded routine within each task, or you want to encapsulate the finest level of parallelism yourself on the task level, you may want to set spark.task.cpus to more than 1.

Setting this parameter to more than 1 is not something you would do often, though.
- The scheduler will not launch a task if the number of available cores is less than the cores required by the task, so if your executor has 8 cores, and you've set spark.task.cpus to 3, only 2 tasks will get launched.
- If your task does not consume the full capacity of the cores all the time, you may find that using spark.task.cpus=1 and experiencing some contention within the task still gives you more performance.
- Overhead from things like GC or I/O probably shouldn't be included in the spark.task.cpus setting, because it'd probably be a much more static cost, that doesn't scale linearly with your task count.

Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?

The JVM will almost always rely on the OS to provide it with info and mechanisms to work with CPUs, and AFAIK Spark doesn't do anything special here. If Runtime.getRuntime().availableProcessors() or ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors() return 4 for your dual-core HT-enabled Intel® processor, Spark will also see 4 cores.

Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?

Like mentioned above, Spark won't automatically parallelize a task according to the spark.task.cpus parameter. Spark is mostly a data parallelism engine and its parallelism is achieved mostly through representing your data as RDDs.