I have multiple Python scripts that do not use a lot of memory. and This script execute on PySpark.
Spark Master(Standalone) have 4cpu(core), 16G memory and so Spark is runnig only 4 scripts by one time. (1 script : 1 core)
But, I want to Spark is running 20~30 scripts by one time How can I do?
This is my spark master web ui images. Please, Help me.
Given you only have 4 cores, it's unlikely you'll be able to achieve that many tasks at the same time and neither should you wish to do that. Parallelism on your Spark instance is likely going to be limited by any of the following:
spark.default.parallelism
and spark.executor.cores
spark.sql.shuffle.partitions
and the actual data partitioning) - the rule of thumb is that a worker (here CPU core) will work on 1 partition at a time, therefore if your script processes a dataframe with 4 partitions, it may likely take up all the cores available to it on your machine.Based on the request above, try setting default parallelism to your number of desired cores, set Spark scheduler to FAIR and consider wrapping your "scripts" in functions that you submit to ThreadPoolExecutor
or a similar implementation (if working from PySpark
) that submits to Spark. This way, Spark will attempt to schedule as many jobs at the same time as possible. However, this doesn't mean 20-30 "scripts" will be processed at the same time (because that likely cannot be achieved), it only means that the jobs will be submitted in parallel and scheduled and processed in a Fair
fashion.