I have a python_wheel_task in one of my asset bundle jobs which executes the whl file that is being built from my local repository from which I deploy the bundle. This process works fine in itself.
However - I need to add a custom dependency whl file (another repo, packaged and published to my Azure Artifact Feed) to the task as a library in order for my local repo's whl file to work completely.
I tried to define it as follows:
- task_key: some_task
job_cluster_key: job_cluster
python_wheel_task:
package_name: my_local_package_name
entry_point: my_entrypoint
named_parameters: { "env": "dev" }
libraries:
- pypi:
package: custom_package==1.0.1
repo: https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/
- whl: ../../dist/*.whl # my local repo's whl: being built as part of the asset-bundle
When I deploy and run the bundle, I get the following error in the job cluster:
24/07/12 07:49:01 ERROR Utils:
Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh
/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install 'custom_package==3.0.1'
--index-url https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/
--disable-pip-version-check) exited with code 1, and Looking in indexes:
https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/
24/07/12 07:49:01 INFO SharedDriverContext: Failed to attach library
python-pypi;custom_package;;3.0.1;https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/
to Spark
I suppose I need to configure a personal access token / authentication for the feed somewhere, but I cannot find anything in the Databricks documentation about library dependencies. There is only one sentence about adding a custom index and nothing about authentication.
How can I get this to work?
Best practice solution for existing all-purpose clusters
I managed to use a combination of an existing cluster, a cluster environment variable and init script to configure the cluster for authentication against a custom PyPI index:
I stored an Azure DevOps PAT in my KeyVault
I created a secret-scope in Databricks for that KeyVault
I uploaded/imported the init-script in Databricks to Workspace/Shared/init-scripts/set-private-artifact-feed.sh
I created an all-purpose cluster and set under configuration
-> Advanced options
:
Environment variable: PYPI_TOKEN={{secrets/<my-scope>/<secret-name-of-devops-pat>}}
Init Scripts: Type Workspace
, File path /Shared/init-scripts/set-private-artifact-feed.sh
Contents of set-private-artifact-feed.sh
:
#!/bin/bash
if [[ $PYPI_TOKEN ]]; then
use $PYPI_TOKEN
fi
echo $PYPI_TOKEN
printf "[global]\n" > /etc/pip.conf
printf "extra-index-url =\n" >> /etc/pip.conf
printf "\thttps://[email protected]/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/\n" >> /etc/pip.conf
After restarting the cluster, I could run my task as I initially defined, the authentication against the index works now. More details in this medium article.
Note that this does not work with a job cluster unless you also pass along the reference to the init script & set the environment variable! Using an all-purpose cluster makes more sense to me.
Hacky solution for job-clusters
We can add the PYPI token to the repo URL. I was unable to set the init-scripts / env-variables for the job-clusters properly to get it to work otherwise.
- pypi:
package: pyspark-framework==4.0.0
repo: https://<YOUR-TOKEN-HERE>@pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/
- whl: ../dist/*.whl
This hacky solution is a major security risk: the token will show up as plain text in your databricks workspace for anyone to see!