Search code examples
mercurialdevelopment-environmentdvcs

Pros and cons for keeping code and data in separate repositories


We have a project which has data and code, bundled into a single Mercurial repository. The data is just as important the code (it contains parameters for business logic, some inputs, etc.) However, the format of the data files changes rarely, and it's quite natural to change the data files independently from the code.

One advantage of the unified repository is that we don't have to keep track of multiple revisions: if we ever need to recreate output from a previous run, we only need to update the system to the single revision number stored in the output log.

One disadvantage is that if we modify the data while multiple heads are active, we may lose the data changes unless we manually copy those changes to each head.

Are there any other pros/cons to splitting the code and the data into separate repositories?


Solution

  • Multiple repos:

    • pros:

      • component-based approach (you identify groups of files that can evolve independently one from another)
      • configuration specification: you list the references (here "revisions") you need for your system to work. If you want to modify one part without changing the other, you update that list.
      • partial clones: if you don't need all components, you can only clone the ones you want (doesn't apply in your case)
    • cons

      • configuration management: you need to track that configuration (usually through a parent repo, registering subrepos)
      • in your case, data is quite dependent on certain versions of the projects (you can have new data which doesn't make sense for old versions of the project)

    One repo

    • pros
      • system-based approach: you see your modules as one system (project and data).
      • repo management: all in one
      • tight link between modules (which can makes sense for data)
    • cons
      • data propagation (when, as you mention, several HEAD are active)
      • intermediate revisions (not to reflect a new feature, but just because some data changes)
      • larger clone (not relevant here, unless your data include large binaries)

    For non-binary data, with infrequent changes, I would still keep them in the same repo.