Search code examples
pythonunit-testingdatabrickspython-wheelpython-poetry

How do I include and install test files in a wheel and deploy to Databricks


I'm developing some code that runs on Databricks. Given that Databricks can't be run locally, I need to run unit tests on a Databricks cluster. Problem is when I install the wheel that contains my files, test files are never installed. How do I install the test files?

Ideally I would like to keep src and tests in separate folders.


Here is my project's (pyproject.toml only) folder structure:

project
├── src
|   ├── mylib
│       ├── functions.py
│       ├── __init__.py
├── pyproject.toml
├── poetry.lock
└── tests
    ├── conftest.py
    └── test_functions.py

My pyproject.toml:

[tool.poetry]
name = "mylib"
version = "0.1.0"
packages = [
    {include = "mylib", from = "src"},
    {include = "tests"}
]

[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Without {include = "tests"} in pyproject.toml, poetry build doesn't include tests.

After poetry build I can see that the tests are included in wheel produced (python3 -m wheel unpack <mywheel.whl>). But after I deploy it as a library on a Databricks cluster, I do not see any tests folder (ls -r .../site-packages/mylib* in a Databricks notebook shell cell) though functions.py is installed.

I also tried moving tests under src and update toml to {include = "tests", from = "src"}, but then the wheel file produced contains mylib & tests with appropriate files, but only mylib gets installed on Databricks.

project
├── src
|   ├── mylib
│   │   ├── functions.py
│   │   └── __init__.py
|   └── tests
│       ├── conftest.py
│       └── test_functions.py
├── pyproject.toml
└── poetry.lock

As someone is trying to point to dbx as teh solution, I've tried to use it. It doesn't work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends. Perhaps in a few years it would do what this post needs.


Solution

  • If anyone else is suffering, here is what we ended up doing finally.

    TL;DR;

    • Create a unit-test-runner.py that can install a wheel file and execute tests inside of it. Key is to install it at "notebook scope".
    • Deploy/copy unit-test-runner.py to databricks dbfs and create a job pointing to it. Job parameter is the wheel file to pytest.
    • Create a wheel of your code, copy it to databricks dbfs, run the job unit-test-runner with location of wheel file as parameter.

    Project structure:

    root
    ├── dist
    │   └── my_project-0.1.0-py3-none-any.whl
    ├── poetry.lock
    ├── poetry.toml
    ├── pyproject.toml
    ├── module1.py
    ├── module2.py
    ├── housekeeping.py
    ├── common
    │   └── aws.py
    ├── tests
    │   ├── conftest.py
    │   ├── test_module1.py
    │   ├── test_module2.py
    │   └── common
    │       └── test_aws.py
    └── unit_test_runner.py
    

    unit-test-runner.py

    import importlib.util
    import logging
    import os
    import shutil
    import sys
    from enum import IntEnum
    
    import pip
    import pytest
    
    
    def main(args: list) -> int:
        coverage_opts = []
        if '--cov' == args[0]:
            coverage_opts = ['--cov']
            wheels_to_test = args[1:]
        else:
            wheels_to_test = args
    
        logging.info(f'coverage_opts: {coverage_opts}, wheels_to_test: {wheels_to_test}')
    
        for wh_file in wheels_to_test:
            logging.info('pip install %s', wh_file)
            pip.main(['install', wh_file])
            # we assume wheel name like <pkg name>-version-...
            # E.g. my_module-0.1.0-py3-none-any.whl
            pkg_name = os.path.basename(wh_file).split('-')[0]
            # don't import module to avoid any issues with coverage data.
            pkg_root = os.path.dirname(importlib.util.find_spec(pkg_name).origin)
            os.chdir(pkg_root)
    
            pytest_opts = [f'--rootdir={pkg_root}']
            pytest_opts.extend(coverage_opts)
    
            logging.info(f'pytest_opts: {pytest_opts}')
            rc = pytest.main(pytest_opts)
            logging.info(f'pytest-status: {rc}/{os.waitstatus_to_exitcode(rc)}, wheel: {wh_file}')
            generate_coverage_data(pkg_name, pkg_root, wh_file)
    
            return rc.value if isinstance(rc, IntEnum) else rc
    
    
    def generate_coverage_data(pkg_name, pkg_root, wh_file):
        if os.path.exists(f'{pkg_root}/.coverage'):
            shutil.rmtree(f'{pkg_root}/htmlcov', ignore_errors=True)
            output_tar = f'{os.path.dirname(wh_file)}/{pkg_name}-coverage.tar.gz'
            rc = os.system(f'coverage html --data-file={pkg_root}/.coverage && tar -cvzf {output_tar} htmlcov')
            logging.info('rc: %s, coverage data available at: %s', rc, output_tar)
    
    
    if __name__ == "__main__":
        # silence annoying logging
        logging.getLogger("py4j").setLevel(logging.ERROR)
        logging.info('sys.argv[1:]: %s', sys.argv[1:])
        rc = main(sys.argv[1:])
        if rc != 0:
            raise Exception(f'Unit test execution failed. rc: {rc}, sys.argv[1:]: sys.argv[1:]')
    
    
    WORKSPACE_ROOT='/home/kash/workspaces'
    USER_NAME='[email protected]'
    cd $WORKSPACE_ROOT/my_project
    echo 'copying runner..' && \
      databricks fs cp --overwrite unit_test_runner.py dbfs:/user/$USER_NAME/
    
    • Go to databricks GUI and create a job pointing to dbfs:/user/$USER_NAME/unit_test_runner.py. Can also be done using CLI.
      • Type of job: Python Script
      • Source: DBFS/S3
      • Path: dbfs:/user/$USER_NAME/unit_test_runner.py
    • Run databricks jobs list to find job id, e.g. 123456789
    cd $WORKSPACE_ROOT/my_project
    poetry build -f wheel # could be replaced with any builder that creates a wheel file
    whl_file=$(ls -1tr dist/my_project*-py3-none-any.whl | tail -1 | xargs basename)
    echo 'copying wheel...' && databricks fs cp --overwrite dist/$whl_file dbfs:/user/$USER_NAME/wheels
    echo 'running job.....' && echo "launching job.." && \
      databricks jobs run-now --job-id 123456789 --python-params "[\"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"
    # OR with coverage
    echo 'running job..' && echo "launching job with coverage.." && \
      databricks jobs run-now --job-id 123456789 --python-params "[\"--cov\", \"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"
    

    If you ran with --cov option then to get and open coverage report:

    rm -f htmlcov/ my_project_coverage_report.tar.gz
    databricks fs cp dbfs:/user/$USER_NAME/wheels/my_project_coverage_report.tar.gz .
    tar -xvzf my_project_coverage_report.tar.gz
    firefox htmlcov/index.html