Parallel Python with joblibspark: how to evenly distribute jobs?

I have a project in which joblib works well on one computer, it sends function to different cores effectively.

Now I have assignment to do same thing on a Databricks cluster. I've tried this many ways today, but the problem in the result is that the jobs do not spread out one-per-compute node. I've got 4 executors, I set n_jobs=6, but when I send 4 jobs through, some of them pile up on same node, leaving nodes unused. Here's a picture of Databricks Spark UI:

. Sometimes when I try this, I get 1 job running on a node by itself and all of the rest are piled up on one node.

In the joblib and joblibspark docs, I see the parameter batch_size which specifies how many tasks are sent to a given node. Even when I set that to 1, I get this same problem, nodes unused.

    from joblib import Parallel, delayed
    from joblibspark import register_spark
        
    register_spark()
        
    output = Parallel(backend="spark", n_jobs=6,
             verbose=config.JOBLIB_VERBOSE, batch_size=1)(
             delayed(fit_one)
             (x, model_data=model_data, dlmodel=dlmodel,
             outdir=outdir,  frac=sample_p,
             score_type=score_type,
             save=save,
             verbose=verbose) for x in ZZ)

I've hacked at this all day, trying various backends and combinations of settings. What am I missing?

Solution

Updates in spark and joblib spark solved this in mid 2022.

https://github.com/joblib/joblib-spark/issues/34