Search code examples
gitdiffpatchlevenshtein-distancelcs

How do diff/patch work and how safe are they?


Regarding how they work, I was wondering low-level working stuff:

  1. What will trigger a merge conflict?
  2. Is the context also used by the tools in order to apply the patch?
  3. How do they deal with changes that do not actually modify source code behavior? For example, swapping function definition places.

Regarding safety, truth be told, the huge Linux kernel repository is a testament for their safety. But I wondering about the following points:

  1. Are there any caveats/limitations regarding the tools that the user should be aware of?
  2. Have the algorithms been proven to not generate wrong results?
  3. If not, are there implementations/papers proposing integration testing that at least prove them to be error-free empirically? Something like the content of these papers BrianKorver and JamesCoplien.
  4. Again, the Linux repository should suffice regarding the previous point, but I was wondering about something more generic. Source code, even when changed, will not change much (specially because of the algorithm implemented and syntax restrictions), but can the safety be generalized to generic text files?

Edit

Ok people, I'm editing since the question is vague and answers are not addressing details.

Git/diff/patch details

The unified diff format, which Git seems to use by default, basically outputs three things: the change, the context surrounding the change, and line numbers pertinent to the context. Each one of these things may or may not have been changed concurrently, so Git basically has to deal with 8 possible cases.

For example, if lines have been added or removed before the context, line numbers will be different; but if the context and the changes are still the same, then diff could use the context itself to align the texts and apply the patch (I do not know if this indeed happens). Now, what would happen on the other cases? I would like to know details of how Git decides to apply changes automatically and when it decides to issue an error and let the user resolve the conflict.

Reliability

I'm pretty much sure the Git is fully reliable since it do have the full history of commits and can traverse history. What I would like is some pointers to academic research and references regarding this, if they exist.

Still kinda related to this subject, we know that Git/diff treat files as generic text files and work on lines. Furthermore, the LCS algorithm employed by diff will generate a patch trying to minimize the number of changes.

So here are some things I would like to know also:

  1. Why is LCS used instead of other string metric algorithms?
  2. If LCS is used, why not use modified versions of the metric that do take into account the grammatical aspects of the underlying language?
  3. If such a metric that takes into account grammatical aspects are used, could they provide benefits? Benefits in this case could be anything, for example, a cleaner "blame log".

Again, these could be huge topics and academic articles are welcome.


Solution

  • What will trigger a merge conflict?

    Let's look at the simplest of git's merge strategies, recursive, first: When merging two branches, say a and b, that have a common ancestor c, git creates a patch to go from commit c to the commit ad the head of a and tries to apply that patch to the tree at the head of b. If the patch fails, that's a merge conflict.

    git by default uses the recursive strategy, a 3-way merge. The general idea is the same: If the 3-way merge algorithm described in the link fails because two commits from different branches changed the same lines, that's a merge conflict.

    Is the context also used by the tools in order to apply the patch?

    Yes. If a patch does not apply at the exact line number stored in the diff file, patch tries to find the right line a couple of lines adjacent to the original one based on the context.

    How do they deal with changes that do not actually modify source code behavior? For example, swapping function definition places.

    patch is not intelligent, it can not differentiate between such changes. It regards a moved function as a couple of added and a couple of deleted lines. If a commit on one branch alters a function and a commit on another moves the unaltered, then an attempt to merge will always give you a merge conflict.

    Are there any caveats/limitations regarding the tools that the user should be aware of?

    As for patch and diff: No. Both use algorithms that have been around since the early 1970s and are quite robust. As long as they don't complain, you can be fairly certain that they did what you intended.

    That being said: git merge tries to resolve merge conflicts on its own. In some rare cases, things can go wrong here - this page has an example close to its end.

    Have the algorithms been proven to not generate wrong results? If not, are there implementations/papers proposing integration testing that at least prove them to be error-free empirically?

    "wrong results" is a fairly unspecific term in this context; I'd claim it cannot be proven. What is empirically proven is that applying a patch generated by diff a b to file a will in any case produce file b.

    Source code, even when changed, will not change much (specially because of the algorithm implemented and syntax restrictions), but can the safety be generalized to generic text files?

    Again, diff/patch/git does not differentiate between source code and other text files. git works as well on generic text files as it does on source code.

    I'm pretty much sure the Git is fully reliable since it do have the full history of commits and can traverse history. What I would like is some pointers to academic research and references regarding this, if they exist.

    Commits in git are snapshots of the tree with meta data, not diffs to the adjacent versions. Patch and diff are not involved in revision traversal at all. (But one level below the surface, git then organizes blobs in pack files that do use a delta compression algorithm. Errors here would be easy to spot because git internally uses sha1 sums to identify files, and the sum would change if an error occurred.)

    Why is LCS used instead of other string metric algorithms?

    git uses Myers' algorithm by default. The original paper explains why it works the way it does. (It's not purely LCS.)