I'm developing some code that runs on Databricks. Given that Databricks can't be run locally, I need to run unit tests on a Databricks cluster. Problem is when I install the wheel that contains my files, test files are never installed. How do I install the test files?
Ideally I would like to keep src
and tests
in separate folders.
Here is my project's (pyproject.toml
only) folder structure:
project
├── src
| ├── mylib
│ ├── functions.py
│ ├── __init__.py
├── pyproject.toml
├── poetry.lock
└── tests
├── conftest.py
└── test_functions.py
My pyproject.toml
:
[tool.poetry]
name = "mylib"
version = "0.1.0"
packages = [
{include = "mylib", from = "src"},
{include = "tests"}
]
[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Without {include = "tests"}
in pyproject.toml
, poetry build
doesn't include tests.
After poetry build
I can see that the tests are included in wheel produced (python3 -m wheel unpack <mywheel.whl>
). But after I deploy it as a library on a Databricks cluster, I do not see any tests folder (ls -r .../site-packages/mylib*
in a Databricks notebook shell cell) though functions.py
is installed.
I also tried moving tests
under src
and update toml to {include = "tests", from = "src"}
, but then the wheel file produced contains mylib
& tests
with appropriate files, but only mylib
gets installed on Databricks.
project
├── src
| ├── mylib
│ │ ├── functions.py
│ │ └── __init__.py
| └── tests
│ ├── conftest.py
│ └── test_functions.py
├── pyproject.toml
└── poetry.lock
As someone is trying to point to dbx
as teh solution, I've tried to use it. It doesn't work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends. Perhaps in a few years it would do what this post needs.
If anyone else is suffering, here is what we ended up doing finally.
TL;DR;
unit-test-runner.py
that can install a wheel file and execute tests inside of it. Key is to install it at "notebook scope".unit-test-runner.py
to databricks dbfs and create a job pointing to it. Job parameter is the wheel file to pytest
.Project structure:
root
├── dist
│ └── my_project-0.1.0-py3-none-any.whl
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── module1.py
├── module2.py
├── housekeeping.py
├── common
│ └── aws.py
├── tests
│ ├── conftest.py
│ ├── test_module1.py
│ ├── test_module2.py
│ └── common
│ └── test_aws.py
└── unit_test_runner.py
unit-test-runner.py
import importlib.util
import logging
import os
import shutil
import sys
from enum import IntEnum
import pip
import pytest
def main(args: list) -> int:
coverage_opts = []
if '--cov' == args[0]:
coverage_opts = ['--cov']
wheels_to_test = args[1:]
else:
wheels_to_test = args
logging.info(f'coverage_opts: {coverage_opts}, wheels_to_test: {wheels_to_test}')
for wh_file in wheels_to_test:
logging.info('pip install %s', wh_file)
pip.main(['install', wh_file])
# we assume wheel name like <pkg name>-version-...
# E.g. my_module-0.1.0-py3-none-any.whl
pkg_name = os.path.basename(wh_file).split('-')[0]
# don't import module to avoid any issues with coverage data.
pkg_root = os.path.dirname(importlib.util.find_spec(pkg_name).origin)
os.chdir(pkg_root)
pytest_opts = [f'--rootdir={pkg_root}']
pytest_opts.extend(coverage_opts)
logging.info(f'pytest_opts: {pytest_opts}')
rc = pytest.main(pytest_opts)
logging.info(f'pytest-status: {rc}/{os.waitstatus_to_exitcode(rc)}, wheel: {wh_file}')
generate_coverage_data(pkg_name, pkg_root, wh_file)
return rc.value if isinstance(rc, IntEnum) else rc
def generate_coverage_data(pkg_name, pkg_root, wh_file):
if os.path.exists(f'{pkg_root}/.coverage'):
shutil.rmtree(f'{pkg_root}/htmlcov', ignore_errors=True)
output_tar = f'{os.path.dirname(wh_file)}/{pkg_name}-coverage.tar.gz'
rc = os.system(f'coverage html --data-file={pkg_root}/.coverage && tar -cvzf {output_tar} htmlcov')
logging.info('rc: %s, coverage data available at: %s', rc, output_tar)
if __name__ == "__main__":
# silence annoying logging
logging.getLogger("py4j").setLevel(logging.ERROR)
logging.info('sys.argv[1:]: %s', sys.argv[1:])
rc = main(sys.argv[1:])
if rc != 0:
raise Exception(f'Unit test execution failed. rc: {rc}, sys.argv[1:]: sys.argv[1:]')
databricks-cli
. See instructions here.WORKSPACE_ROOT='/home/kash/workspaces'
USER_NAME='kash@company.com'
cd $WORKSPACE_ROOT/my_project
echo 'copying runner..' && \
databricks fs cp --overwrite unit_test_runner.py dbfs:/user/$USER_NAME/
dbfs:/user/$USER_NAME/unit_test_runner.py
. Can also be done using CLI.
dbfs:/user/$USER_NAME/unit_test_runner.py
databricks jobs list
to find job id, e.g. 123456789cd $WORKSPACE_ROOT/my_project
poetry build -f wheel # could be replaced with any builder that creates a wheel file
whl_file=$(ls -1tr dist/my_project*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying wheel...' && databricks fs cp --overwrite dist/$whl_file dbfs:/user/$USER_NAME/wheels
echo 'running job.....' && echo "launching job.." && \
databricks jobs run-now --job-id 123456789 --python-params "[\"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"
# OR with coverage
echo 'running job..' && echo "launching job with coverage.." && \
databricks jobs run-now --job-id 123456789 --python-params "[\"--cov\", \"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"
If you ran with --cov
option then to get and open coverage report:
rm -f htmlcov/ my_project_coverage_report.tar.gz
databricks fs cp dbfs:/user/$USER_NAME/wheels/my_project_coverage_report.tar.gz .
tar -xvzf my_project_coverage_report.tar.gz
firefox htmlcov/index.html