Search code examples
pysparkdatabricksazure-databricks

Databricks notebooks + Repos spark session scoping breakdown


I'm using databricks, and I have a repo in which I have a basic python module within which I define a class. I'm able to import and access the class and its methods from the databricks notebook.

One of the methods within the class within the module looks like this (simplified)

    def read_raw_json(self):
        self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")

When I execute this particular method within the databricks notebook it gives me a NameError that 'spark' is not defined. The databricks runtime instantiates with a spark session stored in a variable called "spark". I assumed any methods executed in that runtime would inherit from the parent scope.

Anyone know why this isn't the case?

EDIT: I was able to get it to work by passing it the spark variable from within the notebook as an object to my class instantiation. But I don't want to call this an answer yet, because I'm not sure why I needed to.


Solution

  • python files (not notebooks) don't have spark initiated at them. when you import the function python raises a NameError as it tried to understand the read_raw_json which references an unknown spark object.

    you can modify the python file like this and everything will work fine:

    from pyspark.sql import SparkSession
    
        def read_raw_json(self):
            spark = SparkSession.builder.getOrCreate()
    
            self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")