Search code examples
databricksdatabricks-connect

Switch between workspaces with databricks-connect


Is it possible to switch workspace with the use of databricks-connect?

I'm currently trying to switch with: spark.conf.set('spark.driver.host', cluster_config['host'])

But this gives back the following error: AnalysisException: Cannot modify the value of a Spark config: spark.driver.host


Solution

  • If you look into documentation on setting the client, then you will see that there are three methods to configure Databricks Connect:

    • Configuration file generated with databricks-connect configure - the file name is always ~/.databricks-connect,
    • Environment variables - DATABRICKS_ADDRESS, DATABRICKS_API_TOKEN, ...
    • Spark Configuration properties - spark.databricks.service.address, spark.databricks.service.token, ... But when using this method, Spark Session could be already initialized, so you may not able switch on the fly, without restarting Spark.

    But if you use different DBR versions, then it's not enough to change configuration properties, you also need to switch Python environments that contains corresponding version of Databricks Connect distribution.

    For my own work I wrote following Zsh script that allows easy switch between different setups (shards) - it allows to use only one shard at time although. Prerequisites are:

    • Python environment is created with name <name>-shard
    • databricks-connect is installed into activated conda environment with:
    pyenv activate field-eng-shard
    pip install -U databricks-connect==<DBR-version>
    
    • databricks-connect is configured once, and configuration for specific cluster/shard is stored in the ~/.databricks-connect-<name> file that will be symlinked to ~/.databricks-connect
    function use-shard() {
        SHARD_NAME="$1"
        if [ -z "$SHARD_NAME" ]; then
            echo "Usage: use-shard shard-name"
            return 1
        fi
        if [ ! -L ~/.databricks-connect ] && [ -f ~/.databricks-connect ]; then
            echo "There is ~/.databricks-connect file - possibly you configured another shard"
        elif [ -f ~/.databricks-connect-${SHARD_NAME} ]; then
            rm -f ~/.databricks-connect
            ln -s ~/.databricks-connect-${SHARD_NAME} ~/.databricks-connect
            pyenv deactivate
            pyenv activate ${SHARD_NAME}-shard
        else
            echo "There is no configuration file for shard: ~/.databricks-connect-${SHARD_NAME}"
        fi
    }