Search code examples
pythonpysparkazure-data-factorydatabricks

CONTEXT_ONLY_VALID_ON_DRIVER : how to access/pass the spark context pandas_udf in another python file


I have a python file (main.py) executing in adf, and i have another file (udf.py) for the pandas udfs. I am importing the udf.py functions into main.py like below:

from udf import func1, func2

when this is executing in adf we are getting error saying: CONTEXT_ONLY_VALID_ON_DRIVER

So, how can i pass the spark context or how to solve this issue.

everything works fine if kept in same file as plain script than modularized. I am suspecting it is something to do with worker and driver nodes. but need some light on the same.


Solution

  • You are referencing spark context inside the broad cast variable that is in your udf function. Instead of doing reference or sending spark object to your udf, create a spark session in udf itself and use it.

    Here, is the sample code i have tried in udf.py

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
    
    def addone():
        udfAddOne = udf(lambda x: x + 1, IntegerType())
    
        df = spark.createDataFrame(data,schema=schema).withColumn("number_plus_one", udfAddOne(col("id")))
        return df