Search code examples
gitmachine-learningcontinuous-integrationdvcmlops

Is it necessary to commit DVC files from our CI pipelines?


DVC uses git commits to save the experiments and navigate between experiments.

Is it possible to avoid making auto-commits in CI/CD (to save data artifacts after dvc repro in CI/CD side).


Solution

  • will you make it part of CI pipeline

    DVC often serves as a part of MLOps infrastructure. There is a popular blog post about CI/CD for ML where DVC is used under the hood. Another example but with GitLab CI/CD.

    scenario where you will integrate dvc commit command with CI pipelines?

    If you mean git commit of DVC files (not dvc commit) then yes, you need to commit dvc-files into Git during CI/CD process. Auto-commit is not the best practice.

    How to avoid Git commit in CI/CD:

    1. After ML model training in CI/CD, save changed dvc-files in external storage (for example GitLab artifact/releases), then get the files to a developer machine and commit there. Users usually write scripts to automate it.
    2. Wait for DVC 1.0 release when run-cache (like build-cache) will be implemented. Run-cache makes dvc-files ephemeral and no additional Git commits will be required. Technically, run-cache is an associative storage repo state --> run results outside of Git repo (in data remote).

    Disclaimer: I'm one of the creators of DVC.