Search code examples
pythonamazon-s3dvc

"SCM error" when using dvc.api.get_url() to access S3 remote repository


I have a remote repository that I want to use with DVC. I want to access my files through DVC in Python using the dvc.api module. Here's the code I'm using:

import dvc.api

path = 'data/test.csv'
repo = 's3://xxx/DVC_test/'
version = 'v1'

data_url = dvc.api.get_url(path=path, repo=repo, rev=version)

However, I'm encountering the following error:

Cloning |                                                          |0.00/? [00:00,      ?obj   
                                                                                               
Cloning |                                                          |0.00/? [00:00,             
                                                                                               
Traceback (most recent call last):   
  File "<input>", line 1, in <module>                                                          
    data_url = dvc.api.get_url(path=path, repo=repo, rev=version)                              
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/dvc/
api/data.py", line 21, in get_url                                                              
    with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
  File "/usr/local/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)                                                                      
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/dvc/
external_repo.py", line 45, in external_repo                                                   
    path = _cached_clone(url, rev, for_write=for_write)
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/dvc/
external_repo.py", line 173, in _cached_clone
    clone_path, shallow = _clone_default_branch(url, rev, for_write=for_write)
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/func
y/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/func
y/flow.py", line 274, in wrap_with
    return call()                              
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/func
y/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/dvc/
external_repo.py", line 241, in _clone_default_branch
    git = clone(url, clone_path)
  File "/home/asokolov/Documents/BG/DVC_pipeline/dvc_test_venv/lib/python3.9/site-packages/dvc/
scm.py", line 165, in clone
    raise CloneError("SCM error") from exc
dvc.scm.CloneError: SCM error

At the same time, running dvc pull works without errors.

Here's my dvc.doctor:

dvc doctor
DVC version: 2.47.0 (pip)
-------------------------
Platform: Python 3.9.16 on Linux-5.19.0-31-generic-x86_64-with-glibc2.36
Subprojects:
        dvc_data = 0.42.1
        dvc_objects = 0.21.1
        dvc_render = 0.2.0
        dvc_task = 0.2.0
        scmrepo = 0.1.15
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.3.0, boto3 = 1.24.59)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git

And my .dvc/config.loval:

['remote "dvc-remote"']
    url = s3://xxx/DVC_test/
    access_key_id = xxx
    secret_access_key = xxx
    region = xxx

Could you please suggest a solution to resolve the issue?


Solution

  • I believe you are slightly misusing the python API see here: https://dvc.org/doc/api-reference/get_url

    It looks like you would want something like this:

    import dvc.api
    
    path = "data/test.csv"
    remote_name = "dvc-remote"
    repo = "https://github.com/username/repo.git"
    version = "v1"
    
    url = dvc.api.get_url(
        path=path,
        remote=remote_name,
        repo=repo,
        rev=version
    )
    
    print(url)