I'm migrating the current hive metastore tables in my Azure Databricks workspace to Unity Catalog (UC), and I encountered and issue related to git clone to a Volume.
So my cluster setting will be something like:
So earlier, in my non UC enabled cluster I would have a cell in the notebook like the following to git clone my repo to a DBFS tmp location:
!git clone https://$MyGitPAT@dev.azure.com/repo_path /tmp/repo
But now since my UC enabled cluster I want to clone the repo inside a volume so I can potentially remove the repo directory at the beginning of the notebook (dbutils.fs.rm("/Volumes/catalogname/schemaname/volumename/tmp/repo", True)
which works), like the following:
!git clone https://$MyGitPAT@dev.azure.com/repo_path /Volumes/catalogname/schemaname/volumename/tmp/repo
But appears to get stuck in the Resolving deltas
step while cloning.
Does anyone has faced this issue, and got a solution to this? I'm thinking maybe the git clone has to be done differently now, or my last option is to maybe include the git clone command in a init script, and make the UC enabled cluster run it when starting up.
Found a workaround which solves the issue initially posted. A modified a CI/CD azure devops pipeline I has running already which in my case runs on the same repository I need to clone, but also can clone external repositories.
First I included a new task during the build stage to copy the repository in a directory, so the task after publishes the directory into an artifact:
- script: | # Copy git repo to tmp repo directory
mkdir -p $(Build.ArtifactStagingDirectory)/repo
find $(Build.SourcesDirectory) -mindepth 1 -maxdepth 1 -exec cp -r {} $(Build.ArtifactStagingDirectory)/repo \;
displayName: Copy repo
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)'
ArtifactName: 'my-artifact'
displayName: Publish Artifact
Then, the second part is that during the deploy stage (you need a download artifact step too) I included a AzureFileCopy@5 task which copies that directory (aka. my repository) into my ADLS (Azure Data Lake Storage) location, which is the same location my Databrick's UC Volume has access to, and therefore I can see my repository in the UC Volume, like the following:
- task: DownloadBuildArtifacts@1
inputs:
artifactName: my-artifact
downloadPath: '$(System.ArtifactsDirectory)'
displayName: Download Build Artifact
- task: AzureFileCopy@5
displayName: Copy repo to storage account
inputs:
SourcePath: $(System.ArtifactsDirectory)/my-artifact/repo
azureSubscription: YourAzureSubscriptionName
Destination: AzureBlob
storage: YourADLSName
ContainerName: YourADLSContainerName
BlobPrefix: YourUCVolumeName/tmp
AdditionalArgumentsForBlobCopy: |
--recursive=true `
--overwrite=true