Search code examples
apache-sparkintellij-ideadatabricksazure-databricks

Why "databricks-connect test" does not work after configurate Databricks Connect?


I wanna run my Spark processes directly in my cluster using IntelliJ IDEA, so I'm following the next documentation https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html

After configuring all, I run databricks-connect test but I'm not obtained the Scala REPL as the documentation says.

enter image description here

That is my cluster configuration

enter image description here


Solution

  • I solve the problem. The problem was the versions of all the tools:

    • Install Java

    Download and install Java SE Runtime Version 8.

    Download and install Java SE Development Kit 8.

    • Install Conda

    You can either download and install full blown Anaconda or use miniconda.

    • Download WinUtils

    This pesty bugger is part of Hadoop and required by Spark to work on Windows. Quick install, open Powershell (as an admin) and run (if you are on a corporate network with funky security you may need to download the exe manually):

    New-Item -Path "C:\Hadoop\Bin" -ItemType Directory -Force
    Invoke-WebRequest -Uri https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe -OutFile "C:\Hadoop\Bin\winutils.exe"
    [Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\Hadoop", "Machine")
    
    • Create Virtual Environment

    We are now a new Virtual Environment. I recommend creating one environment per project you are working on. This allow us to install different versions of Databricks-Connect per project and upgrade them separately.

    From the Start menu find the Anaconda Prompt. When it opens it will have a default prompt of something like:

    (base) C:\Users\User The base part means you are not in a virtual environment, rather the base install. To create a new environment execute this:

    conda create --name dbconnect python=3.5
    

    Where dbconnect is the name of your environment and can be what you want. Databricks currently runs Python 3.5 - your Python version must match. Again this is another good reason for having an environment per project as this may change in the future.

    • Now activate the environment:

      conda activate dbconnect

    • Install Databricks-Connect

    You are now good to go:

    pip install -U databricks-connect==5.3.*
    
    databricks-connect configure
    

    enter image description here

    • Create Databricks cluster (in this case I used Amazon Web Services)

    enter image description here

    spark.databricks.service.server.enabled true
    spark.databricks.service.port 15001 (Amazon 15001, Azure 8787)
    
    • Turn Windows Defender Firewall Off or allow access.