Search code examples
javaapache-sparkspark-submit

Is it possible (and wise) to execute other "spark-submit" inside a JavaRDD?


I'm trying to execute a Spark program with spark-submit (in particular GATK Spark tools, so the command is not spark-submit, but something similar): this program accept an only input, so I'm trying to write some Java code in order to accept more inputs.

In particular I'm trying to execute a spark-submit for each input, through the pipe function of JavaRDD:

JavaRDD<String> bashExec = ubams.map(ubam -> par1 + "|" + par2)
            .pipe("/path/script.sh");

where par1 and par2 are parameters that will be passed to the script.sh, which will handle (splitting by "|" ) and use them to execute something similar to spark-submit.

Now, I don't expect to obtain speedup compared to the execution of a single input because I'm calling other Spark functions, but just to distribute the workload of more inputs on different nodes and have linear execution time to the number of inputs.

For example, the GATK Spark tool lasted about 108 minutes with an only input, with my code I would expect that with two similar inputs it would last something similar to about 216 minutes.

I noticed that that the code "works", or rather I obtain the usual output on my terminal. But in at least 15 hours, the task wasn't completed and it was still executing.

So I'm asking if this approach (executing spark-submit with the pipe function) is stupid or probably there are other errors?

I hope to be clear in explaining my issue.

P.S. I'm using a VM on Azure with 28GB of Memory and 4 execution threads.


Solution

  • Is it possible

    Yes, it is technically possible. With a bit caution it is even possible to create a new SparkContext in the worker thread, but

    Is it (...) wise

    No. You should never do something like this. There is a good reason for Spark disallowing nested parallelization in the first place. Anything that happens inside a task is a black-box, therefore it cannot be accounted during DAG computation and resources allocation. In the worst case scenario job will just deadlock with the main job waiting for the tasks to finish, and tasks waiting for the main job to release required resource.

    How to solve this. The problem is rather roughly outlined so it hard to give you a precise advice but you can:

    • Use driver local loop to submit multiple jobs sequentially from a single application.
    • Use threading and in-application scheduling to submit multiple jobs concurrently from a single application.
    • Use independent orchestration tool to submit multiple independent applications, each handling one set of parameters.