Search code examples
gitmachine-learningdvc

How to add/update data with dvc workflow?


I want to know , when we set up DVC, can I simply, add my entire directory as such, dvc add dataset and my workflow would be to update the entire dataset folder for next iteration. The contents of this folder should be cached. And if I ever wanna go back to previous version of data, I should be able to do a dvc checkout? Or is it better to add each file to DVC individually?

— .dvc
  - config
— dataset
  - fileone.cvs
- train.py
- requirements.txt

I have tracked individual files so far, but would be easier to track entire folder in the event I have 100s of files?


Solution

  • Yes, the whole directory can be added at once and this is the recommended way to handle directories in DVC. Having 100s of .dvc files is discouraged and not what DVC is optimized for.

    Here is an example in the documentation. Pretty much, you can do:

    dvc add dataset
    

    No matter how many files are inside the dataset directory, DVC will create a single dataset.dvc file that will handle the whole directory. Files will be cached (one time per unique file per dataset).

    To update it later, you could run dvc add or dvc commit. To get to the previous version, you will be able to do use the same mechanics as described here.

    Here is the brief summary of some technical details that I recommend to read if you'd like to understand the implications better.

    If there a lot of files inside the directory, please also read Large Dataset Optimization.