python unit-testing databricks python-wheel python-poetry

How do I include and install test files in a wheel and deploy to Databricks

I'm developing some code that runs on Databricks. Given that Databricks can't be run locally, I need to run unit tests on a Databricks cluster. Problem is when I install the wheel that contains my files, test files are never installed. How do I install the test files?

Ideally I would like to keep src and tests in separate folders.

Here is my project's (pyproject.toml only) folder structure:

project
├── src
|   ├── mylib
│       ├── functions.py
│       ├── __init__.py
├── pyproject.toml
├── poetry.lock
└── tests
    ├── conftest.py
    └── test_functions.py

My pyproject.toml:

[tool.poetry]
name = "mylib"
version = "0.1.0"
packages = [
    {include = "mylib", from = "src"},
    {include = "tests"}
]

[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Without {include = "tests"} in pyproject.toml, poetry build doesn't include tests.

After poetry build I can see that the tests are included in wheel produced (python3 -m wheel unpack <mywheel.whl>). But after I deploy it as a library on a Databricks cluster, I do not see any tests folder (ls -r .../site-packages/mylib* in a Databricks notebook shell cell) though functions.py is installed.

I also tried moving tests under src and update toml to {include = "tests", from = "src"}, but then the wheel file produced contains mylib & tests with appropriate files, but only mylib gets installed on Databricks.

project
├── src
|   ├── mylib
│   │   ├── functions.py
│   │   └── __init__.py
|   └── tests
│       ├── conftest.py
│       └── test_functions.py
├── pyproject.toml
└── poetry.lock

As someone is trying to point to dbx as teh solution, I've tried to use it. It doesn't work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends. Perhaps in a few years it would do what this post needs.

Solution

If anyone else is suffering, here is what we ended up doing finally.

TL;DR;

Create a unit-test-runner.py that can install a wheel file and execute tests inside of it. Key is to install it at "notebook scope".
Deploy/copy unit-test-runner.py to databricks dbfs and create a job pointing to it. Job parameter is the wheel file to pytest.
Create a wheel of your code, copy it to databricks dbfs, run the job unit-test-runner with location of wheel file as parameter.

Project structure:

root
├── dist
│   └── my_project-0.1.0-py3-none-any.whl
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── module1.py
├── module2.py
├── housekeeping.py
├── common
│   └── aws.py
├── tests
│   ├── conftest.py
│   ├── test_module1.py
│   ├── test_module2.py
│   └── common
│       └── test_aws.py
└── unit_test_runner.py

unit-test-runner.py

import importlib.util
import logging
import os
import shutil
import sys
from enum import IntEnum

import pip
import pytest


def main(args: list) -> int:
    coverage_opts = []
    if '--cov' == args[0]:
        coverage_opts = ['--cov']
        wheels_to_test = args[1:]
    else:
        wheels_to_test = args

    logging.info(f'coverage_opts: {coverage_opts}, wheels_to_test: {wheels_to_test}')

    for wh_file in wheels_to_test:
        logging.info('pip install %s', wh_file)
        pip.main(['install', wh_file])
        # we assume wheel name like <pkg name>-version-...
        # E.g. my_module-0.1.0-py3-none-any.whl
        pkg_name = os.path.basename(wh_file).split('-')[0]
        # don't import module to avoid any issues with coverage data.
        pkg_root = os.path.dirname(importlib.util.find_spec(pkg_name).origin)
        os.chdir(pkg_root)

        pytest_opts = [f'--rootdir={pkg_root}']
        pytest_opts.extend(coverage_opts)

        logging.info(f'pytest_opts: {pytest_opts}')
        rc = pytest.main(pytest_opts)
        logging.info(f'pytest-status: {rc}/{os.waitstatus_to_exitcode(rc)}, wheel: {wh_file}')
        generate_coverage_data(pkg_name, pkg_root, wh_file)

        return rc.value if isinstance(rc, IntEnum) else rc


def generate_coverage_data(pkg_name, pkg_root, wh_file):
    if os.path.exists(f'{pkg_root}/.coverage'):
        shutil.rmtree(f'{pkg_root}/htmlcov', ignore_errors=True)
        output_tar = f'{os.path.dirname(wh_file)}/{pkg_name}-coverage.tar.gz'
        rc = os.system(f'coverage html --data-file={pkg_root}/.coverage && tar -cvzf {output_tar} htmlcov')
        logging.info('rc: %s, coverage data available at: %s', rc, output_tar)


if __name__ == "__main__":
    # silence annoying logging
    logging.getLogger("py4j").setLevel(logging.ERROR)
    logging.info('sys.argv[1:]: %s', sys.argv[1:])
    rc = main(sys.argv[1:])
    if rc != 0:
        raise Exception(f'Unit test execution failed. rc: {rc}, sys.argv[1:]: sys.argv[1:]')

Install and configure databricks-cli. See instructions here.

WORKSPACE_ROOT='/home/kash/workspaces'
USER_NAME='kash@company.com'
cd $WORKSPACE_ROOT/my_project
echo 'copying runner..' && \
  databricks fs cp --overwrite unit_test_runner.py dbfs:/user/$USER_NAME/

Go to databricks GUI and create a job pointing to dbfs:/user/$USER_NAME/unit_test_runner.py. Can also be done using CLI.
- Type of job: Python Script
- Source: DBFS/S3
- Path: dbfs:/user/$USER_NAME/unit_test_runner.py
Run databricks jobs list to find job id, e.g. 123456789

cd $WORKSPACE_ROOT/my_project
poetry build -f wheel # could be replaced with any builder that creates a wheel file
whl_file=$(ls -1tr dist/my_project*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying wheel...' && databricks fs cp --overwrite dist/$whl_file dbfs:/user/$USER_NAME/wheels
echo 'running job.....' && echo "launching job.." && \
  databricks jobs run-now --job-id 123456789 --python-params "[\"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"
# OR with coverage
echo 'running job..' && echo "launching job with coverage.." && \
  databricks jobs run-now --job-id 123456789 --python-params "[\"--cov\", \"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"

If you ran with --cov option then to get and open coverage report:

rm -f htmlcov/ my_project_coverage_report.tar.gz
databricks fs cp dbfs:/user/$USER_NAME/wheels/my_project_coverage_report.tar.gz .
tar -xvzf my_project_coverage_report.tar.gz
firefox htmlcov/index.html