Search code examples
scalaapache-sparkpyspark

How to get the number of workers(executors) in PySpark?


I need to use this parameter, so how can I get the number of workers? Like in Scala, I can call sc.getExecutorMemoryStatus to get the available number of workers. But in PySpark, it seems there's no API exposed to get this number.


Solution

  • In scala, getExecutorStorageStatus and getExecutorMemoryStatus both return the number of executors including driver. like below example snippet

    /** Method that just returns the current active/registered executors
            * excluding the driver.
            * @param sc The spark context to retrieve registered executors.
            * @return a list of executors each in the form of host:port.
            */
           def currentActiveExecutors(sc: SparkContext): Seq[String] = {
             val allExecutors = sc.getExecutorMemoryStatus.map(_._1)
             val driverHost: String = sc.getConf.get("spark.driver.host")
             allExecutors.filter(! _.split(":")(0).equals(driverHost)).toList
           }
    

    But In python api it was not implemented

    @DanielDarabos answer also confirms this.

    The equivalent to this in python...

    sc.getConf().get("spark.executor.instances")
    

    Edit (python) :

    %python
    sc = spark._jsc.sc() 
    n_workers =  len([executor.host() for executor in sc.statusTracker().getExecutorInfos() ]) -1
    
    print(n_workers)
    

    As Danny mentioned in the comment if you want to cross verify them you can use the below statements.

    %python
    
    sc = spark._jsc.sc() 
    
    result1 = sc.getExecutorMemoryStatus().keys() # will print all the executors + driver available
    
    result2 = len([executor.host() for executor in sc.statusTracker().getExecutorInfos() ]) -1
    
    print(result1, end ='\n')
    print(result2)
    

    Example Result :

    Set(10.172.249.9:46467)
    0