apache-spark intellij-idea databricks azure-databricks

Why "databricks-connect test" does not work after configurate Databricks Connect?

I wanna run my Spark processes directly in my cluster using IntelliJ IDEA, so I'm following the next documentation https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html

After configuring all, I run databricks-connect test but I'm not obtained the Scala REPL as the documentation says.

That is my cluster configuration

Solution

I solve the problem. The problem was the versions of all the tools:

Install Java

Download and install Java SE Runtime Version 8.

Download and install Java SE Development Kit 8.

Install Conda

You can either download and install full blown Anaconda or use miniconda.

Download WinUtils

This pesty bugger is part of Hadoop and required by Spark to work on Windows. Quick install, open Powershell (as an admin) and run (if you are on a corporate network with funky security you may need to download the exe manually):

New-Item -Path "C:\Hadoop\Bin" -ItemType Directory -Force
Invoke-WebRequest -Uri https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe -OutFile "C:\Hadoop\Bin\winutils.exe"
[Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\Hadoop", "Machine")

Create Virtual Environment

We are now a new Virtual Environment. I recommend creating one environment per project you are working on. This allow us to install different versions of Databricks-Connect per project and upgrade them separately.

From the Start menu find the Anaconda Prompt. When it opens it will have a default prompt of something like:

(base) C:\Users\User The base part means you are not in a virtual environment, rather the base install. To create a new environment execute this:

conda create --name dbconnect python=3.5

Where dbconnect is the name of your environment and can be what you want. Databricks currently runs Python 3.5 - your Python version must match. Again this is another good reason for having an environment per project as this may change in the future.

Now activate the environment:

conda activate dbconnect
Install Databricks-Connect

You are now good to go:

pip install -U databricks-connect==5.3.*

databricks-connect configure

Create Databricks cluster (in this case I used Amazon Web Services)

spark.databricks.service.server.enabled true
spark.databricks.service.port 15001 (Amazon 15001, Azure 8787)

Turn Windows Defender Firewall Off or allow access.