Databricks notebooks + Repos spark session scoping breakdown

I'm using databricks, and I have a repo in which I have a basic python module within which I define a class. I'm able to import and access the class and its methods from the databricks notebook.

One of the methods within the class within the module looks like this (simplified)

    def read_raw_json(self):
        self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")

When I execute this particular method within the databricks notebook it gives me a NameError that 'spark' is not defined. The databricks runtime instantiates with a spark session stored in a variable called "spark". I assumed any methods executed in that runtime would inherit from the parent scope.

Anyone know why this isn't the case?

EDIT: I was able to get it to work by passing it the spark variable from within the notebook as an object to my class instantiation. But I don't want to call this an answer yet, because I'm not sure why I needed to.

Solution

python files (not notebooks) don't have spark initiated at them. when you import the function python raises a NameError as it tried to understand the read_raw_json which references an unknown spark object.

you can modify the python file like this and everything will work fine:

from pyspark.sql import SparkSession

    def read_raw_json(self):
        spark = SparkSession.builder.getOrCreate()

        self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")