I have my databricks python code in github
. I setup a basic workflow to lint the python code using flake8
. This fails because the names that are implicitly available to my script (like spark
, sc
, dbutils
, getArgument
etc) when it runs on databricks are not available when flake8
lints it outside databricks (in github ubuntu vm).
How can I lint databricks notebooks in github
using flake8
?
E.g. errors I get:
test.py:1:1: F821 undefined name 'dbutils'
test.py:3:11: F821 undefined name 'getArgument'
test.py:5:1: F821 undefined name 'dbutils'
test.py:7:11: F821 undefined name 'spark'
my notebook in github:
dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")
jdbcurl = getArgument("my_jdbcurl")
dbutils.fs.ls(".")
df_node = spark.read.format("jdbc")\
.option("driver", "org.mariadb.jdbc.Driver")\
.option("url", jdbcurl)\
.option("dbtable", "my_table")\
.option("user", "my_username")\
.option("password", "my_pswd")\
.load()
my .github/workflows/lint.yml
on:
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v1
with:
python-version: 3.8
- run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Lint with flake8
run: |
pip install flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
TL;DR
Don't use the built-in variable dbutils
in code that would need to run locally (IDE, Unit tests, ...) and in Databricks (production). Create your own instance of DBUtils
class instead.
Here is what we ended up doing:
Created a new dbk_utils.py
from pyspark.sql import SparkSession
def get_dbutils(spark: SparkSession):
try:
from pyspark.dbutils import DBUtils
return DBUtils(spark)
except ModuleNotFoundError:
import IPython
return IPython.get_ipython().user_ns["dbutils"]
And update the code that uses dbutils
to use this utility:
from dbk_utils import get_dbutils
my_dbutils = get_dbutils()
my_dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")
my_dbutils.fs.ls(".")
jdbcurl = my_dbutils.widgets.getArgument("my_jdbcurl")
df_node = spark.read.format("jdbc")\
.option("driver", "org.mariadb.jdbc.Driver")\
.option("url", jdbcurl)\
.option("dbtable", "my_table")\
.option("user", "my_username")\
.option("password", "my_pswd")\
.load()
If you're trying to do unit testing as well, then check out: