Search code examples
pythoncondadatabricksenvironmentazure-databricks

Create local conda environment with the databricks runtime ML 8.2


I am trying to replicate Azure Databricks runtime ML 8.2 in my local computer so I don't need to start a cluster in Azure Databricks for testing purposes but yet have the same environment (dependencies). For that I started by exporting the dependencies from a Databricks notebook running %conda env export -f /dbfs/path/to/environment_8.2_ML.yml.

Then at my PC (Mac OS) I already tried running conda env create --file=environment_8.2_ML.yml but it doesn’t find some libraries:

ResolvePackageNotFound:
  - libgfortran-ng=7.3.0
  - ld_impl_linux-64=2.33.1
  - libstdcxx-ng=9.1.0
  - libgcc-ng=9.1.0

I also needed to remove the build versión from each one of the conda libraries.

If any of you have a proper YAML file or have successfully achieve replicate databricks runtime ML in local computer, please help :)

Thank you in advance!


Solution

  • Ok, I've found the way to achieve this by creating a docker image and then run it in a Docker container, so the conda environment would be created in that container. Somehow, If I do the following steps but in my Mac OS computer (not doing a docker container but installing CMAKE with brew and then create a conda environment) it does not work.

    Steps I have followed to be able to install all dependencies from the Databricks runtime ML 8.2 YAML file:

    1. Get Databricks runtime ML 8.2 YAML file from here (as @alexott mention in his answer)

    2. Remove from the YAML file those conda libraries:

        - libgfortran-ng=7.3.0
        - ld_impl_linux-64=2.33.1
        - libstdcxx-ng=9.1.0
        - libgcc-ng=9.1.0
    

    So the YAML file will look like this:

    name: databricks-ml-8.2
    channels:
      - pytorch
      - defaults
    dependencies:
      - _libgcc_mutex=0.1=main
      - absl-py=0.11.0=pyhd3eb1b0_1
      - aiohttp=3.7.4=py38h27cfd23_1
      - asn1crypto=1.4.0=py_0
      - astor=0.8.1=py38h06a4308_0
      - async-timeout=3.0.1=py38h06a4308_0
      - attrs=20.3.0=pyhd3eb1b0_0
      - backcall=0.2.0=pyhd3eb1b0_0
      - bcrypt=3.2.0=py38h7b6447c_0
      - blas=1.0=mkl
      - blinker=1.4=py38h06a4308_0
      - boto3=1.16.7=pyhd3eb1b0_0
      - botocore=1.19.7=pyhd3eb1b0_0
      - brotlipy=0.7.0=py38h27cfd23_1003
      - bzip2=1.0.8=h7b6447c_0
      - c-ares=1.17.1=h27cfd23_0
      - ca-certificates=2021.4.13=h06a4308_1 # (updated from 2021.1.19 in May 18, 2021 maintenance update)
      - cachetools=4.2.1=pyhd3eb1b0_0
      - certifi=2020.12.5=py38h06a4308_0
      - cffi=1.14.3=py38h261ae71_2
      - chardet=3.0.4=py38h06a4308_1003
      - click=7.1.2=pyhd3eb1b0_0
      - cloudpickle=1.6.0=py_0
      - configparser=5.0.1=py_0
      - cpuonly=1.0=0
      - cryptography=3.1.1=py38h1ba5d50_0
      - cycler=0.10.0=py38_0
      - cython=0.29.21=py38h2531618_0
      - decorator=4.4.2=pyhd3eb1b0_0
      - dill=0.3.2=py_0
      - docutils=0.15.2=py38h06a4308_1
      - entrypoints=0.3=py38_0
      - ffmpeg=4.2.2=h20bf706_0
      - flask=1.1.2=pyhd3eb1b0_0
      - freetype=2.10.4=h5ab3b9f_0
      - future=0.18.2=py38_1
      - gitdb=4.0.5=py_0
      - gitpython=3.1.12=pyhd3eb1b0_1
      - gmp=6.1.2=h6c8ec71_1
      - gnutls=3.6.5=h71b1129_1002
      - google-auth=1.22.1=py_0
      - google-auth-oauthlib=0.4.2=pyhd3eb1b0_2
      - google-pasta=0.2.0=py_0
      - gunicorn=20.0.4=py38h06a4308_0
      - h5py=2.10.0=py38h7918eee_0
      - hdf5=1.10.4=hb1b8bf9_0
      - icu=58.2=he6710b0_3
      - idna=2.10=pyhd3eb1b0_0
      - importlib-metadata=2.0.0=py_1
      - intel-openmp=2019.4=243
      - ipykernel=5.3.4=py38h5ca1d4c_0
      - ipython=7.19.0=py38hb070fc8_1
      - ipython_genutils=0.2.0=pyhd3eb1b0_1
      - isodate=0.6.0=py_1
      - itsdangerous=1.1.0=pyhd3eb1b0_0
      - jedi=0.17.2=py38h06a4308_1
      - jinja2=2.11.2=pyhd3eb1b0_0
      - jmespath=0.10.0=py_0
      - joblib=0.17.0=py_0
      - jpeg=9b=h024ee3a_2
      - jupyter_client=6.1.7=py_0
      - jupyter_core=4.6.3=py38_0
      - kiwisolver=1.3.0=py38h2531618_0
      - krb5=1.17.1=h173b8e3_0
      - lame=3.100=h7b6447c_0
      - lcms2=2.11=h396b838_0
      - libedit=3.1.20191231=h14c3975_1
      - libffi=3.3=he6710b0_2
      - libopus=1.3.1=h7b6447c_0
      - libpng=1.6.37=hbc83047_0
      - libpq=12.2=h20c2e04_0
      - libprotobuf=3.13.0.1=hd408876_0
      - libsodium=1.0.18=h7b6447c_0
      - libtiff=4.1.0=h2733197_1
      - libuv=1.40.0=h7b6447c_0
      - libvpx=1.7.0=h439df22_0
      - lightgbm=3.1.1=py38h2531618_0
      - lz4-c=1.9.2=heb0550a_3
      - mako=1.1.3=py_0
      - markdown=3.3.3=py38h06a4308_0
      - markupsafe=1.1.1=py38h7b6447c_0
      - matplotlib-base=3.2.2=py38hef1b27d_0
      - mkl=2019.4=243
      - mkl-service=2.3.0=py38he904b0f_0
      - mkl_fft=1.2.0=py38h23d657b_0
      - mkl_random=1.1.0=py38h962f231_0
      - more-itertools=8.6.0=pyhd3eb1b0_0
      - multidict=5.1.0=py38h27cfd23_2
      - ncurses=6.2=he6710b0_1
      - nettle=3.4.1=hbb512f6_0
      - networkx=2.5=py_0
      - ninja=1.10.2=py38hff7bd54_0
      - nltk=3.5=py_0
      - numpy=1.19.2=py38h54aff64_0
      - numpy-base=1.19.2=py38hfa32c7d_0
      - oauthlib=3.1.0=py_0
      - olefile=0.46=py_0
      - openh264=2.1.0=hd408876_0
      - openssl=1.1.1k=h27cfd23_0 # (updated from 1.1.1i in May 18, 2021 maintenance update)
      - packaging=20.4=py_0
      - pandas=1.1.3=py38he6710b0_0
      - paramiko=2.7.2=py_0
      - parso=0.7.0=py_0
      - patsy=0.5.1=py38_0
      - pexpect=4.8.0=pyhd3eb1b0_3
      - pickleshare=0.7.5=pyhd3eb1b0_1003
      - pillow=8.0.1=py38he98fc37_0
      - pip=20.2.4=py38h06a4308_0
      - plotly=4.14.3=pyhd3eb1b0_0
      - prompt-toolkit=3.0.8=py_0
      - prompt_toolkit=3.0.8=0
      - protobuf=3.13.0.1=py38he6710b0_1
      - psutil=5.7.2=py38h7b6447c_0
      - psycopg2=2.8.5=py38h3c74f83_1
      - ptyprocess=0.6.0=pyhd3eb1b0_2
      - pyasn1=0.4.8=py_0
      - pyasn1-modules=0.2.8=py_0
      - pycparser=2.20=py_2
      - pygments=2.7.2=pyhd3eb1b0_0
      - pyjwt=1.7.1=py38_0
      - pynacl=1.4.0=py38h7b6447c_1
      - pyodbc=4.0.30=py38he6710b0_0
      - pyopenssl=19.1.0=pyhd3eb1b0_1
      - pyparsing=2.4.7=pyhd3eb1b0_0
      - pysocks=1.7.1=py38h06a4308_0
      - python=3.8.8=hdb3f193_4 # (updated from 3.8.5 in May 18, 2021 maintenance update)
      - python-dateutil=2.8.1=pyhd3eb1b0_0
      - python-editor=1.0.4=py_0
      - pytorch=1.8.1=py3.8_cpu_0
      - pytz=2020.5=pyhd3eb1b0_0
      - pyzmq=19.0.2=py38he6710b0_1
      - readline=8.0=h7b6447c_0
      - regex=2020.10.15=py38h7b6447c_0
      - requests=2.24.0=py_0
      - requests-oauthlib=1.3.0=py_0
      - retrying=1.3.3=py_2
      - rsa=4.7.2=pyhd3eb1b0_1
      - s3transfer=0.3.6=pyhd3eb1b0_0
      - scikit-learn=0.23.2=py38h0573a6f_0
      - scipy=1.5.2=py38h0b6359f_0
      - setuptools=50.3.1=py38h06a4308_1
      - simplejson=3.17.2=py38h27cfd23_2
      - six=1.15.0=py38h06a4308_0
      - smmap=3.0.5=pyhd3eb1b0_0
      - sqlite=3.33.0=h62c20be_0
      - sqlparse=0.4.1=py_0
      - statsmodels=0.12.0=py38h7b6447c_0
      - tabulate=0.8.7=py38h06a4308_0
      - threadpoolctl=2.1.0=pyh5ca1d4c_0
      - tk=8.6.10=hbc83047_0
      - torchvision=0.9.1=py38_cpu
      - tornado=6.0.4=py38h7b6447c_1
      - tqdm=4.50.2=py_0
      - traitlets=5.0.5=pyhd3eb1b0_0
      - typing-extensions=3.7.4.3=hd3eb1b0_0
      - typing_extensions=3.7.4.3=pyh06a4308_0
      - unixodbc=2.3.9=h7b6447c_0
      - urllib3=1.25.11=py_0
      - wcwidth=0.2.5=py_0
      - websocket-client=0.57.0=py38_2
      - werkzeug=1.0.1=pyhd3eb1b0_0
      - wheel=0.35.1=pyhd3eb1b0_0
      - wrapt=1.12.1=py38h7b6447c_1
      - x264=1!157.20191217=h7b6447c_0
      - xz=5.2.5=h7b6447c_0
      - yarl=1.6.3=py38h27cfd23_0
      - zeromq=4.3.3=he6710b0_3
      - zipp=3.4.0=pyhd3eb1b0_0
      - zlib=1.2.11=h7b6447c_3
      - zstd=1.4.5=h9ceee32_0
      - pip:
        - argon2-cffi==20.1.0
        - astunparse==1.6.3
        - async-generator==1.10
        - azure-core==1.11.0
        - azure-storage-blob==12.7.1
        - bleach==3.3.0
        - confuse==1.4.0
        - databricks-cli==0.14.3
        - defusedxml==0.7.1
        - diskcache==5.2.1
        - docker==4.4.4
        - flatbuffers==1.12
        - gast==0.3.3
        - grpcio==1.32.0
        - horovod==0.21.3
        - htmlmin==0.1.12
        - imagehash==4.2.0
        - ipywidgets==7.6.3
        - joblibspark==0.3.0
        - jsonschema==3.2.0
        - jupyterlab-pygments==0.1.2
        - jupyterlab-widgets==1.0.0
        - keras-preprocessing==1.1.2
        - koalas==1.7.0
        - llvmlite==0.36.0
        - missingno==0.4.2
        - mistune==0.8.4
        - mleap==0.16.1
        - mlflow-skinny==1.15.0
        - msrest==0.6.21
        - nbclient==0.5.3
        - nbconvert==6.0.7
        - nbformat==5.1.2
        - nest-asyncio==1.5.1
        - notebook==6.3.0
        - numba==0.53.1
        - opt-einsum==3.3.0
        - pandas-profiling==2.11.0
        - pandocfilters==1.4.3
        - petastorm==0.9.8
        - phik==0.11.2
        - prometheus-client==0.9.0
        - pyarrow==1.0.1
        - pyrsistent==0.17.3
        - pywavelets==1.1.1
        - pyyaml==5.4.1
        - querystring-parser==1.2.4
        - seaborn==0.10.0
        - send2trash==1.5.0
        - shap==0.39.0
        - slicer==0.0.7
        - spark-tensorflow-distributor==0.1.0
        - tangled-up-in-unicode==0.0.7
        - tensorboard==2.4.1
        - tensorboard-plugin-wit==1.8.0
        - tensorflow-cpu==2.4.1
        - tensorflow-estimator==2.4.0
        - termcolor==1.1.0
        - terminado==0.9.4
        - testpath==0.4.4
        - visions==0.6.0
        - webencodings==0.5.1
        - widgetsnbextension==3.5.1
        - xgboost==1.3.3
    prefix: /databricks/conda/envs/databricks-ml
    
    1. Create your Dockerfile:
    FROM continuumio/miniconda3
    
    # Install CMAKE
    RUN apt-get update && apt-get install build-essential -y
    RUN apt-get -y install cmake
    
    ADD <your-yaml-file.yml> /tmp/environment.yml
    # ADD requirements.txt /tmp/requirements.txt
    RUN conda env create -f /tmp/environment.yml
    
    
    # Get the environment name out of the environment.yml
    RUN echo "source activate $(head -1 /tmp/environment.yml | cut -d' ' -f2)" > ~/.bashrc
    ENV PATH /opt/conda/envs/$(head -1 /tmp/environment.yml | cut -d' ' -f2)/bin:$PATH
    

    And now you need to run the docker container (once you've built your image). Don't forget to activate the environment created in that container.

    Hope this helps anyone with the same issue :)