Search code examples
pythonlinuxbashcluster-computingazure-databricks

install python packages using init scripts in a databricks cluster


I have installed the databricks cli tool by running the following command

pip install databricks-cli using the appropriate version of pip for your Python installation. If you are using Python 3, run pip3.

Then by creating a PAT (personal-access token in Databricks) I run the following .sh bash script:

# You can run this on Windows as well, just change to a batch files
# Note: You need the Databricks CLI installed and you need a token configued
#!/bin/bash
echo "Creating DBFS direcrtory"
dbfs mkdirs dbfs:/databricks/packages

echo "Uploading cluster init script"
dbfs cp --overwrite python_dependencies.sh                     dbfs:/databricks/packages/python_dependencies.sh

echo "Listing DBFS direcrtory"
dbfs ls dbfs:/databricks/packages

python_dependencies.sh script

#!/bin/bash
# Restart cluster after running.

sudo apt-get install applicationinsights=0.11.9 -V -y
sudo apt-get install azure-servicebus=0.50.2 -V -y
sudo apt-get install azure-storage-file-datalake=12.0.0 -V -y
sudo apt-get install humanfriendly=8.2 -V -y
sudo apt-get install mlflow=1.8.0 -V -y
sudo apt-get install numpy=1.18.3 -V -y
sudo apt-get install opencensus-ext-azure=1.0.2 -V -y
sudo apt-get install packaging=20.4 -V -y
sudo apt-get install pandas=1.0.3 -V -y
sudo apt update
sudo apt-get install scikit-learn=0.22.2.post1 -V -y
status=$?
echo "The date command exit status : ${status}"

I use the above script to install python libraries in the init-scripts of the cluster

enter image description here

My problem is that even though everything seems to be fine and the cluster is started successfully, the libraries are not installed properly. When I click on the libraries tab of the cluster I get this:

enter image description here Only 1 out of the 10 python libraries is installed.

Appreciate your help and comments.


Solution

  • I have found the solution based on the comment of @RedCricket,

    #!/bin/bash
    
    pip install applicationinsights==0.11.9
    pip install azure-servicebus==0.50.2
    pip install azure-storage-file-datalake==12.0.0
    pip install humanfriendly==8.2
    pip install mlflow==1.8.0
    pip install numpy==1.18.3
    pip install opencensus-ext-azure==1.0.2
    pip install packaging==20.4
    pip install pandas==1.0.3
    pip install --upgrade scikit-learn==0.22.2.post1
    

    The above .sh file will install all the python dependencies referenced when the cluster is starting. So, the libraries won't have to be re-installed when the notebook is re-executed.