Search code examples
pythondvc

DVC imports authentication to blob storage


I'm using DVC to track and version data that is stored locally on the file system and in Azure Blob storage.

My setup is as follows:

  • DataProject1, it uses a local file location as a remote therefore it does not require any authentication.

  • DataProject2, it uses Azure Blob Storage as a remote, it is using sas_token for authentication, I can push pull data to/from the remote when I'm within this project.

  • MLProject, it uses dvc import to import data from DataProjec1 and DataProject2.

When I run the import with the command against DataProject1 everything works fine:

dvc import -o 'data/project1' 'https://company.visualstudio.com/DefaultCollection/proj/_git/DataProject1' 'data/project1' - Successful

Howevever when I run a similar command against DataProject2 the command fails:

dvc import -o 'data/project2' 'https://company.visualstudio.com/DefaultCollection/proj/_git/DataProject2' 'data/project2' - it fails with:

ERROR: unexpected error - Operation returned an invalid status 'This request is not authorized to perform this operation using this permission.' ErrorCode:AuthorizationPermissionMismatch.

I would like to configure the dvc import so that I can set the required sas_token but I cannot find a way to do that.


Solution

  • This happens since DVC is not using MLProject's config when it clones and does dvc fetch in the DataProject2 during the import. And it doesn't know where it can find the token (clearly, it's not in the Git repo, right?).

    There are a few ways to specify it: global/system configs and/or environment variables.

    To implement the first option:

    On a machine where you do dvc import, you could create a remote in the --global, or --system configs with the same name and specify the token there. Global config fields will be merged with the config in the DataProject2 repo when DVC is pulling data to import.

    dvc remote add --global <DataProject2-remote-name> azure://DataProject2/storage
    dvc remote modify --global <DataProject2-remote-name> account_name <name>
    dvc remote modify --global <DataProject2-remote-name> sas_token <token>
    

    The second option:

    export AZURE_STORAGE_SAS_TOKEN='mysecret'
    export AZURE_STORAGE_ACCOUNT='myaccount'
    

    Please give it a try, let me know if that works or not.