Search code examples
multithreadingapache-sparkpysparkmultiprocessingrdd

How to execute multiple scripts on Spark?


I have multiple Python scripts that do not use a lot of memory. and This script execute on PySpark.

Spark Master(Standalone) have 4cpu(core), 16G memory and so Spark is runnig only 4 scripts by one time. (1 script : 1 core)

But, I want to Spark is running 20~30 scripts by one time How can I do?

This is my spark master web ui images. Please, Help me.

enter image description here enter image description here


Solution

  • Given you only have 4 cores, it's unlikely you'll be able to achieve that many tasks at the same time and neither should you wish to do that. Parallelism on your Spark instance is likely going to be limited by any of the following:

    1. Hardware (i.e. how many physical cores your machine has)
    2. What is the configuration of your Spark parallelism and core usage, i.e. Spark Config setting such as spark.default.parallelism and spark.executor.cores
    3. What is the configuration of your Spark scheduler (FIFO/Fair) - if this is set to FIFO your instance will attempt to solve one "script" at a time (but still working in parallel as described in point 5)
    4. How are you submitting your "scripts" - if you're submitting them from one process one thread and always collecting back to Python, then they will run in series, as their collection back to Python (driver) is likely blocking
    5. The partitioning of your data (spark.sql.shuffle.partitions and the actual data partitioning) - the rule of thumb is that a worker (here CPU core) will work on 1 partition at a time, therefore if your script processes a dataframe with 4 partitions, it may likely take up all the cores available to it on your machine.
    6. The actual scripts and actions in them.

    Based on the request above, try setting default parallelism to your number of desired cores, set Spark scheduler to FAIR and consider wrapping your "scripts" in functions that you submit to ThreadPoolExecutor or a similar implementation (if working from PySpark) that submits to Spark. This way, Spark will attempt to schedule as many jobs at the same time as possible. However, this doesn't mean 20-30 "scripts" will be processed at the same time (because that likely cannot be achieved), it only means that the jobs will be submitted in parallel and scheduled and processed in a Fair fashion.