Search code examples
pythonapache-sparkpysparkjupyter-notebook

Is there a way to access pysparks executors and send jobs to them manually via Jupyter or Zeppelin notebook?


Probably just a pipe dream, but is there a way to access pysparks executors and send jobs to them manually in a Jupyter or Zeppelin notebook?

It probably goes against all pyspark convention as well, but the idea is to access a running EMR clusters executors(workers) and send them python scripts to run. Sort of like pythons multiprocessing where the pool is instead the executors themselves, and you just feed them a map or a list of the python scripts path+arguments or a function.

pyspark_executors = sc.getExecutors()
 
def square(number):    
    return number ** 2 
 
numbers = [1, 2, 3, 4, 5] 
with pyspark_executors.Pool() as pool: 
    results = pool.map(square, numbers) 
print(results) 

Solution

  • There is a sort of workaround, that sort of accomplishes what I was looking for and that is using UDF (User Defined Function). Inside the udf function you can call boto3 or whatever and do work on all of the rows. For example using the withColumn

    pathList = ["a","b"]
    df = spark.createDataFrame(pathList, StringType())
    
    def myFunction(a):
        {insert boto3 code or whatever code here, but you have to create your clients here}
        return "hello world"
    
    myUDF= udf(myFunction)
    df = df.withColumn("result", myUDF("value"))
    df.show(truncate=False)