I'm trying to execute a Spark program with spark-submit
(in particular GATK Spark tools, so the command is not spark-submit
, but something similar): this program accept an only input, so I'm trying to write some Java code in order to accept more inputs.
In particular I'm trying to execute a spark-submit
for each input, through the pipe function of JavaRDD:
JavaRDD<String> bashExec = ubams.map(ubam -> par1 + "|" + par2)
.pipe("/path/script.sh");
where par1
and par2
are parameters that will be passed to the script.sh, which will handle (splitting by "|" ) and use them to execute something similar to spark-submit
.
Now, I don't expect to obtain speedup compared to the execution of a single input because I'm calling other Spark functions, but just to distribute the workload of more inputs on different nodes and have linear execution time to the number of inputs.
For example, the GATK Spark tool lasted about 108 minutes with an only input, with my code I would expect that with two similar inputs it would last something similar to about 216 minutes.
I noticed that that the code "works", or rather I obtain the usual output on my terminal. But in at least 15 hours, the task wasn't completed and it was still executing.
So I'm asking if this approach (executing spark-submit
with the pipe function) is stupid or probably there are other errors?
I hope to be clear in explaining my issue.
P.S. I'm using a VM on Azure with 28GB of Memory and 4 execution threads.
Is it possible
Yes, it is technically possible. With a bit caution it is even possible to create a new SparkContext
in the worker thread, but
Is it (...) wise
No. You should never do something like this. There is a good reason for Spark disallowing nested parallelization in the first place. Anything that happens inside a task is a black-box, therefore it cannot be accounted during DAG computation and resources allocation. In the worst case scenario job will just deadlock with the main job waiting for the tasks to finish, and tasks waiting for the main job to release required resource.
How to solve this. The problem is rather roughly outlined so it hard to give you a precise advice but you can: