Search code examples
pythonpysparkazure-synapsespark-notebook

Calling referenced functions after mssparkutil.notebook.run?


How can I call functions defined in a different Synapse notebook after running the notebook with mssparkutils.notebook.run()?

example:

#parameters
value = "test"
from notebookutils import mssparkutils

mssparkutils.notebook.run("function definitions", 60, {"param": value})
df = load_cosmos_data() #defined in 'function definitions' notebook

This fails with: NameError: name 'load_cosmos_data' is not defined

I can use the functions with the %run command, but I need to be able to pass the parameter through to the function definitions notebook. %run doesn't allow me to pass a variable as a parameter.


Solution

  • After going through this Official Microsoft Documentation,

    When referencing other notebook, after the exit from the referenced notebook with exit() or without that, the source notebook script will be executed and they will become two different notebooks which have no relationship between them. We can’t access any variable from that notebook, and it applies to the functions of that notebook as well.

    In general programming languages as well, we can’t access the variables of a function which are local to it after its return. It is only possible when we return that variable.

    Unfortunately, the exit() method doesn’t support returning values other than strings from the referenced notebook.

    From the above code, assuming that you need to access the dataframe which is returning from the function load_cosmos_data() in referenced notebook. You can do it using the temporary views.

    Please follow the demonstration below:

    In the referenced notebook call the function and store the returned dataframe in a variable and create a temporary view for that. You can store this temporary view as dataframe in the source notebook.

    Function Notebook:
    Code:

    from pyspark.sql.types import StructType,StructField, StringType, IntegerType
    def load_data():
        data2 = [(24,"Rakesh","Govindula"),
            (16,"Virat","Kohli")]
        schema = StructType([ \
            StructField("id",IntegerType(),True), \
            StructField("firstname",StringType(),True), \
            StructField("lastname",StringType(),True)
        ])
        df = spark.createDataFrame(data=data2,schema=schema)
        return df
    
    df2=load_data()
    df2.show()    
    df2.createOrReplaceTempView("dataframeview")
    mssparkutils.notebook.exit("dataframeview")
    

    enter image description here

    Source Notebook:
    Code:

    value="test"
    from notebookutils import mssparkutils
    view_name=mssparkutils.notebook.run("/function_notebook", 60, {"param": value})   
    df=spark.sql("select * from {0}".format(view_name))
    df.show()
    

    enter image description here

    With this approach you can pass the parameter through to function notebook and can access the dataframe returned from the function as well.

    Please go through this SO Thread if you face any issues when returning values from synapse notebook.