Search code examples
azuredatabricksazure-databricksdask

Unable to install Dask in Azure Databricks


I am trying to install Dask in Azure Databricks, and to do so, I am following the following documentation: https://github.com/dask-contrib/dask-databricks

First I have created the init script and added to the cluster. When the cluster is on, in the Event log, I can see the message of "Finished init scripts execution.":

{
  "init_scripts": {
    "reported_for_node": "0000-1111111-xxxxxxxx_10_139_64_16",
    "global": [],
    "cluster": [
      {
        "workspace": {
          "destination": "/Users/XXXXXXXXXX/dask-init.sh"
        },
        "status": "SUCCEEDED",
        "execution_duration_seconds": 37
      }
    ]
  }
} 

After that, I am trying to execute anything from a notebook but the session appearly is not starting. Also I am getting the following error:

Failure starting repl. Try detaching and re-attaching the notebook.

    at com.databricks.spark.chauffeur.ExecContextState.processInternalMessage(ExecContextState.scala:346)

I have also tried detaching and re-attaching, but it never works. Any advice or any way to install dask in Azure Databricks?

Thanks beforehand

EDIT: RESOLVED

As @tom schimoler suggested, to resolve the issue I have followed this steps:

-On my init script I have set Numpy version to 1.24.2:

#!/bin/bash

# Install numpy 1.23
/databricks/python/bin/pip install numpy==1.24.2

# Install Dask + Dask Databricks
/databricks/python/bin/pip install --upgrade dask[complete] dask-databricks

# Start Dask cluster components
dask databricks run

-Used Runtime 15.4 LTS ML cluster


Solution

  • I just had this same issue, when things were working fine a few weeks ago. In my case, it turned out to be a library dependency issue.

    As it happens the latest version of dask (released on 2024-08-30) is 2024.8.2, which has a min numpy version >= 1.24; I'm on runtime 15.4 LTS ML, which has numpy 1.23. So of course pip installs numpy 2.1, which breaks compatibility with every other library. So, you can try specifying numpy==1.24 in your init script along with the dask install.

    Not sure this is what's happening in your case. If you can open a terminal on your cluster you should be able to verify what version of dask and numpy got installed.