Search code examples
gitazure-devopsdatabricksazure-databricksdatabricks-unity-catalog

Git clone to a Databricks Unity Catalog enabled Volume


I'm migrating the current hive metastore tables in my Azure Databricks workspace to Unity Catalog (UC), and I encountered and issue related to git clone to a Volume.

So my cluster setting will be something like:

  • DBR 13.3 LTS
  • Mode: Shared (UC enabled)

So earlier, in my non UC enabled cluster I would have a cell in the notebook like the following to git clone my repo to a DBFS tmp location:

!git clone https://[email protected]/repo_path /tmp/repo

But now since my UC enabled cluster I want to clone the repo inside a volume so I can potentially remove the repo directory at the beginning of the notebook (dbutils.fs.rm("/Volumes/catalogname/schemaname/volumename/tmp/repo", True) which works), like the following:

!git clone https://[email protected]/repo_path /Volumes/catalogname/schemaname/volumename/tmp/repo

But appears to get stuck in the Resolving deltas step while cloning.

Does anyone has faced this issue, and got a solution to this? I'm thinking maybe the git clone has to be done differently now, or my last option is to maybe include the git clone command in a init script, and make the UC enabled cluster run it when starting up.


Solution

  • Found a workaround which solves the issue initially posted. A modified a CI/CD azure devops pipeline I has running already which in my case runs on the same repository I need to clone, but also can clone external repositories.

    First I included a new task during the build stage to copy the repository in a directory, so the task after publishes the directory into an artifact:

    - script: | # Copy git repo to tmp repo directory
        mkdir -p $(Build.ArtifactStagingDirectory)/repo
        find $(Build.SourcesDirectory) -mindepth 1 -maxdepth 1 -exec cp -r {} $(Build.ArtifactStagingDirectory)/repo \;
      displayName: Copy repo         
    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'my-artifact'
      displayName: Publish Artifact
    

    Then, the second part is that during the deploy stage (you need a download artifact step too) I included a AzureFileCopy@5 task which copies that directory (aka. my repository) into my ADLS (Azure Data Lake Storage) location, which is the same location my Databrick's UC Volume has access to, and therefore I can see my repository in the UC Volume, like the following:

    - task: DownloadBuildArtifacts@1
      inputs:  
        artifactName: my-artifact
        downloadPath: '$(System.ArtifactsDirectory)' 
        displayName: Download Build Artifact
    - task: AzureFileCopy@5
      displayName: Copy repo to storage account
      inputs:
        SourcePath: $(System.ArtifactsDirectory)/my-artifact/repo
        azureSubscription: YourAzureSubscriptionName
        Destination: AzureBlob
        storage: YourADLSName
        ContainerName: YourADLSContainerName
        BlobPrefix: YourUCVolumeName/tmp
        AdditionalArgumentsForBlobCopy: |
          --recursive=true `
          --overwrite=true