Search code examples
dvc

Updating tracked dir in DVC


According to this tutorial when I update file I should remove file from under DVC control first (i.e. execute dvc unprotect <myfile>.dvc or dvc remove <myfile>.dvc) and then add it again via dvc add <mifile>. However It's not clear if I should apply the same workflow for the directories.

I have the directory under DVC control with the following structure:

data/
    1.jpg
    2.jpg

Should I run dvc unprotect data every time the directory content is updated?

More specifically I'm interested if I should run dvc unprotect data in the following use cases:

  • New file is added. For example if I put 3.jpg image in the data dir
  • File is deleted. For example if I delete 2.jpg image in the data dir
  • File is updated. For example if I edit 1.jpg image via graphic editor.
  • A combination of the previous use cases (i.e. some files are updated, other deleted and new files are added)

Solution

  • Only when file is updated - i.e. edit 1.jpg with your editor AND only if hadrlink or symlink cache type is enabled.

    Please, check this link:

    updating tracked files has to be carried out with caution to avoid data corruption when the DVC config option cache.type is set to hardlink or/and symlink

    I would strongly recommend reading this document: Performance Optimization for Large Files it explains benefits of using hardlinks/symlinks.