I'm using databricks, and I have a repo in which I have a basic python module within which I define a class. I'm able to import and access the class and its methods from the databricks notebook.
One of the methods within the class within the module looks like this (simplified)
def read_raw_json(self):
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
When I execute this particular method within the databricks notebook it gives me a NameError
that 'spark' is not defined. The databricks runtime instantiates with a spark session stored in a variable called "spark". I assumed any methods executed in that runtime would inherit from the parent scope.
Anyone know why this isn't the case?
EDIT: I was able to get it to work by passing it the spark variable from within the notebook as an object to my class instantiation. But I don't want to call this an answer yet, because I'm not sure why I needed to.
python files (not notebooks) don't have spark
initiated at them.
when you import the function python raises a NameError
as it tried to understand the read_raw_json
which references an unknown spark object.
you can modify the python file like this and everything will work fine:
from pyspark.sql import SparkSession
def read_raw_json(self):
spark = SparkSession.builder.getOrCreate()
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")