Search code examples
pythonapache-sparkgithubdatabricksflake8

flake8 linting for databricks python code in github using workflows


I have my databricks python code in github. I setup a basic workflow to lint the python code using flake8. This fails because the names that are implicitly available to my script (like spark, sc, dbutils, getArgument etc) when it runs on databricks are not available when flake8 lints it outside databricks (in github ubuntu vm).

How can I lint databricks notebooks in github using flake8?

E.g. errors I get:

test.py:1:1: F821 undefined name 'dbutils'
test.py:3:11: F821 undefined name 'getArgument'
test.py:5:1: F821 undefined name 'dbutils'
test.py:7:11: F821 undefined name 'spark'

my notebook in github:

dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")

jdbcurl = getArgument("my_jdbcurl")

dbutils.fs.ls(".")

df_node = spark.read.format("jdbc")\
  .option("driver", "org.mariadb.jdbc.Driver")\
  .option("url", jdbcurl)\
  .option("dbtable", "my_table")\
  .option("user", "my_username")\
  .option("password", "my_pswd")\
  .load()

my .github/workflows/lint.yml

on:
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v1
      with:
        python-version: 3.8
    - run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Lint with flake8
      run: |
        pip install flake8
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

Solution

  • TL;DR

    Don't use the built-in variable dbutils in code that would need to run locally (IDE, Unit tests, ...) and in Databricks (production). Create your own instance of DBUtils class instead.


    Here is what we ended up doing:

    Created a new dbk_utils.py

    from pyspark.sql import SparkSession
    
    def get_dbutils(spark: SparkSession):
        try:
            from pyspark.dbutils import DBUtils
            return DBUtils(spark)
    
        except ModuleNotFoundError:
            import IPython
            return IPython.get_ipython().user_ns["dbutils"]
    

    And update the code that uses dbutils to use this utility:

    from dbk_utils import get_dbutils
    
    my_dbutils = get_dbutils()
    
    my_dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")
    my_dbutils.fs.ls(".")
    
    jdbcurl = my_dbutils.widgets.getArgument("my_jdbcurl")
    
    df_node = spark.read.format("jdbc")\
      .option("driver", "org.mariadb.jdbc.Driver")\
      .option("url", jdbcurl)\
      .option("dbtable", "my_table")\
      .option("user", "my_username")\
      .option("password", "my_pswd")\
      .load()
    

    If you're trying to do unit testing as well, then check out: