Search code examples
version-controldvcs

DVCS and data loss?


After almost two years of using DVCS, it seems that one inherent "flaw" is accidental data loss: I have lost code which isn't pushed, and I know other people who have as well.

I can see a few reasons for this: off-site data duplication (ie, "commits have to go to a remote host") is not built in, the repository lives in the same directory as the code and the notion of "hack 'till you've got something to release" is prevalent... But that's beside the point.

I'm curious to know: have you experienced DVCS-related data loss? Or have you been using DVCS without trouble? And, related, apart from "remember to push often", is there anything which can be done to minimize the risk?


Solution

  • I've lost more data from clobbering uncommitted changes in a centralized VCS, and then deciding that I actually wanted them, than from anything I've done with a DVCS. Part of that is that I've been using CVS for almost a decade and git for under a year, so I've had a lot more opportunities to get into trouble with the centralized model, but differences in the properties of the workflow between the two models are also major contributing factors.

    Interestingly, most of the reasons for this boil down to "BECAUSE it's easier to discard data, I'm more likely to keep it until I'm sure I don't want it". (The only difference between discarding data and losing it is that you meant to discard it.) The biggest contributing factor is probably a quirk of my workflow habits - my "working copy" when I'm using a DVCS is often several different copies spread out over multiple computers, so corruption or loss in a single repo or even catastrophic data loss on the computer I've been working on is less likely to destroy the only copy of the data. (Being able to do this is a big win of the distributed model over centralized ones - when every commit becomes a permanent part of the repository, the psychological barrier to copying tentative changes around is a lot higher.)

    As far as minimizing the risks, it's possible to develop habits that minimize them, but you have to develop those habits. Two general principles there:

    • Data doesn't exist until there are multiple copies of it in different places. There are workflow habits that will give you multiple copies for free - f'rexample, if you work in two different places, you'll have a reason to push to a common location at the end of every work session, even if it's not ready to publish.
    • Don't try to do anything clever, stupid, or beyond your comfort zone with the only reference to a commit you might want to keep. Create a temporary tag that you can revert to, or create a temporary branch to do the operations on. (git's reflog lets you recover old references after the fact; I'd be unsurprised if other DVCSs have similar functionality. So manual tagging may not be necessary, but it's often more convenient anyways.)