Delta Lake 'OPTIMIZE' command does not use all available nodes

Setup

AWS EMR 6.10
100 x r5d.4xlarge nodes
Spark 3.3.1
Delta Lake 2.2.0

Spark submit conf args (not all, just related to performance)

--num-executors 100  
--executor-memory 64g  
--conf spark.executor.memoryOverhead=54g  
--executor-cores 15  
--conf spark.default.parallelism=1500

Command I execute

OPTIMIZE delta.`<path>` WHERE partition_by_hour_column between <...> and <...>

Here is problem: when I check Spark UI, there is never more than 14-15 parallel jobs, despite the fact I have 100 nodes

Question: how to increase parallelism? Let's say it cannot do 1500 but I would like to have at least one job per executor.

Solution

There is configuration which allows more thread to be used for OPTIMIZE

    buildConf("optimize.maxThreads")
        .internal()
        .doc(
          """
            |Maximum number of parallel jobs allowed in OPTIMIZE command. Increasing the maximum
            | parallel jobs allows the OPTIMIZE command to run faster, but increases the job
            | management on the Spark driver side.
            |""".stripMargin)
        .intConf
        .checkValue(_ > 0, "'optimize.maxThreads' must be positive.")
        .createWithDefault(15)

So in my case it should be additional option passed to the Spark

--conf spark.databricks.delta.optimize.maxThreads=100