Search code examples
algorithmdiffdelta

Algorithm to Combine Text Deltas into Single 'Superstring'


I'm writing software to track the changes an author made over several editions of a book. I already wrote code that produces a set of deltas describing differences between two editions.

Now I'm looking for an algorithm to combine all of these diffs inline to create a 'superstring' containing all of the text inserted and removed across every edition. Then I want to mark up the string in HTML with information about where the text was added and removed.

This way I can visualize the differences between the texts by simply applying different CSS attributes to the document.

Example

If the author changes a sentence in this way

-0-    --1--    ---2---    ---3---
' ' -> 'cat' -> 'crate' -> 'crane'

My code produces these deltas

0-1) <insert 'cat' at 0>
1-2) <insert 'r' at 1> <insert 'e' at 3>
2-3) <remove from 3 to 4> <insert 'n' at 3>

Which I want to process to create a file like this:

<span class="inserted-1">c</span>
<span class="inserted-2">r</span>
<span class="inserted-1">a</span>
<span class="inserted-1 removed-3">t</span>
<span class="inserted-3">n</span>
<span class="inserted-2">e</span>

Question

What's the best algorithm to accomplish this task? Is there a name for this problem?


Solution

  • You can just concatenate your changes and keep track of when they were inserted/removed. Note that the numbers give the index within the string (and note that chars removed do not increase index).

    Step 1: 0-1) <insert 'cat' at 0>

    • [0] c inserted at step 1
    • [1] a inserted at step 1
    • [2] t inserted at step 1

    Step 2: 1-2) <insert 'r' at 1> <insert 'e' at 3>

    • [0] c inserted at step 1
    • [1] r inserted at step 2 <= this was inserted here in this step at position 1
    • [2] a inserted at step 1
    • [3] t inserted at step 1
    • [4] e inserted at step 2 <= this was inserted here in this step at position 3

    Note that the position of 'e' was actually shifted to 4 due to the other insertion.

    Step 3: 2-3) <remove from 3> <insert 'n' at 3> <= I changed this one to the minimal diff

    • [0] c inserted at step 1
    • [1] r inserted at step 2
    • [2] a inserted at step 1
    • [3] t inserted at step 1, removed at step 3 <= does no longer count, thus the next index is the same
    • [3] n inserted at step 3 <= this was inserted here in this step at position 3
    • [4] e inserted at step 2

    So the basic algorithm is:

    • maintain a list of characters, together with step of insertion and step of removal
    • for each step do
      • decompose your diff from this step into single char inserts and removals
      • for a insert of a new char X at position P do the following:
        • insert the new char X after the latest char in your list with index P, set step of insertion to current step and adjust indices of following items (i.e. add one).
      • for a deletion at a position P do
        • mark the char in your list with index P (there is only one of them which still is present, i.e. where the removal step is not set) by setting the removal step to the current step and adjust later indices (i.e. subtract one)

    In both cases note that previous insertions/removals within this step may shift the position of current action (one way to easily work around this is to do insertions/removals backwards from end of string to start).

    The result will be a list of changes as you specified in your question. For lots of changes it might become quite unreadable but it will nevertheless describe the complete history of your text.