Search code examples
dvc

How to configure that specific files should go to different remotes with DVC?


I'd like to have these remotes (from .dvc/config):

['remote "test-data"']
    url = gs://some-test-bucket/dvc
['remote "prod-data"']
    url = gs://some-prod-bucket/dvc

(I have not set a default remote.)

And I have some test data in the folder ./test-data, and production data in the folder ./prod-data.

The .dvc file for prod data:

$ cat prod-data.dvc
outs:
- md5: 057682599b100f0240ca51b6256ed7d5.dir
  size: 135840994497
  nfiles: 17008
  hash: md5
  path: prod-data

Example of .dvc file for test data:

$ cat test-data/some_folder.dvc
outs:
- md5: c06520abe0140c72004dbe4494a78b23.dir
  size: 692847854
  nfiles: 8
  hash: md5
  path: some_folder

I want the command dvc pull -r prod-data to only give me the ./prod-data/ folder, but instead it's fetching more:

$ dvc pull -r prod-data
A       prod-data/
A       test-data/some_folder/
A       test-data/some_other_folder_entirely/
3 files added

How can I set this up so that the test files are stored in one remote, while the prod data is stored in another? Maybe I'm misunderstanding how DVC should be used?

Thanks!


Solution

  • In this setup dvc pull -r prod-data and dvc push -r prod-data try to pull / push all data (all .dvc files). Unless you explicitly specify a target: dvc pull -r test-data test-data/some_folder.

    To actually split data by a few remotes, you need to use the remote field:

    outs:
    - md5: c06520abe0140c72004dbe4494a78b23.dir
      size: 692847854
      nfiles: 8
      hash: md5
      path: some_folder
      remote: test-data
    

    At the moment, I don't think it can be specified as part of the dvc add command that creates the .dvc files for you. It's expected that you manage this (and some other fields) manually.

    After it's done, you won't need also to keep specifying -r on dvc pull / dvc push. It will pick this automatically.

    Please, give it a try and let me know if you hit some issues.