Search code examples
dockercluster-computingdatabricksazure-databricksworkspace

Missing Workspace Directory in Customized Databricks Cluster


I have recently started working with Azure Databricks for some machine learning pipelines. For that I need to be able to create and use custom docker images for the clusters where I can install all my dependencies.

I tried to follow the provided official documentation here in this page! and looked at the official sample dockerfiles Here in the official git repo. So far I have been able to follow the examples and create an image using this example of a miniconda cluster they provided.

When I create my cluster using this customized docker image and start it on databricks everything seems to be fine, my dependencies are installed and I can use the cluster normally if I create a notebook in my workspace and attach it to cluster. However, if I try to do the same from a notebook that is set in my repositories I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/Workspace/Repos/[my-id]/[my-repo-name]'

And indeed, when I check the directories available in the cluster I do not see any Workspace/. It is good to mention that if I create a normal cluster from the UI without using any custom docker image, there is no issue as I can find the workspace copied in the cluster and it can be used from notebooks within the repo.

I am not sure what am I doing wrong? or whether there is an step that I have missed? I do not know what is exactly different between using the custom image for the clusters over using the provided ones that copies the workspace in the cluster? Wanted to ask this question if someone has an answer to it.

The image that I am creating for the cluster is this:

    FROM ubuntu:18.04 as builder

RUN apt-get update && apt-get install --yes \
    wget \
    libdigest-sha-perl \
    bzip2

RUN wget -q https://repo.continuum.io/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh -O miniconda.sh \
    # Conda must be installed at /databricks/conda
    && /bin/bash miniconda.sh -b -p /databricks/conda \
    && rm miniconda.sh

FROM databricksruntime/minimal:9.x

COPY --from=builder /databricks/conda /databricks/conda

COPY environment.yml /databricks/.conda-env-def/env.yml

RUN /databricks/conda/bin/conda env create --file /databricks/.conda-env-def/env.yml \
    # Source conda.sh for all login shells.
    && ln -s /databricks/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh

RUN /databricks/conda/bin/conda config --system --set channel_priority strict \
    && /databricks/conda/bin/conda config --system --set always_yes True

ENV DEFAULT_DATABRICKS_ROOT_CONDA_ENV=[my_conda_env_name]
ENV DATABRICKS_ROOT_CONDA_ENV=[my_conda_env_name]
ENV PYSPARK_PYTHON=/databricks/conda/bin/conda

ENV USER root

Solution

  • The /Workspace path is a special kind of mount point that maps your workspace objects stored in the control plane (Databricks environment) into the real files on the machines running inside your environment (data plane). To have this mount point you need a special script that is shipped by default inside the Databricks runtimes, but it's missing in your setup.

    I would recommend to open a ticket against Microsoft support to help with getting this script that you'll need to install inside your Docker container (Azure Databricks is Microsoft product, so all support cases needs to go through them. Another possibility - contact your admin maybe they have direct contact with Databricks representatives).

    But the main question - do you really need to use custom Docker container? In most cases it's enough to setup libraries using the cluster init scripts, especially if you store all necessary libraries as binary packages (so you don't need to compile them) and put on DBFS from which you can install directly without any network transmission