I'm writing software to track the changes an author made over several editions of a book. I already wrote code that produces a set of deltas describing differences between two editions.
Now I'm looking for an algorithm to combine all of these diffs inline to create a 'superstring' containing all of the text inserted and removed across every edition. Then I want to mark up the string in HTML with information about where the text was added and removed.
This way I can visualize the differences between the texts by simply applying different CSS attributes to the document.
If the author changes a sentence in this way
-0- --1-- ---2--- ---3---
' ' -> 'cat' -> 'crate' -> 'crane'
My code produces these deltas
0-1) <insert 'cat' at 0>
1-2) <insert 'r' at 1> <insert 'e' at 3>
2-3) <remove from 3 to 4> <insert 'n' at 3>
Which I want to process to create a file like this:
<span class="inserted-1">c</span>
<span class="inserted-2">r</span>
<span class="inserted-1">a</span>
<span class="inserted-1 removed-3">t</span>
<span class="inserted-3">n</span>
<span class="inserted-2">e</span>
What's the best algorithm to accomplish this task? Is there a name for this problem?
You can just concatenate your changes and keep track of when they were inserted/removed. Note that the numbers give the index within the string (and note that chars removed do not increase index).
Step 1: 0-1) <insert 'cat' at 0>
[0] c inserted at step 1
[1] a inserted at step 1
[2] t inserted at step 1
Step 2: 1-2) <insert 'r' at 1> <insert 'e' at 3>
[0] c inserted at step 1
[1] r inserted at step 2
<= this was inserted here in this step at position 1[2] a inserted at step 1
[3] t inserted at step 1
[4] e inserted at step 2
<= this was inserted here in this step at position 3Note that the position of 'e' was actually shifted to 4 due to the other insertion.
Step 3: 2-3) <remove from 3> <insert 'n' at 3>
<= I changed this one to the minimal diff
[0] c inserted at step 1
[1] r inserted at step 2
[2] a inserted at step 1
[3] t inserted at step 1, removed at step 3
<= does no longer count, thus the next index is the same[3] n inserted at step 3
<= this was inserted here in this step at position 3[4] e inserted at step 2
So the basic algorithm is:
In both cases note that previous insertions/removals within this step may shift the position of current action (one way to easily work around this is to do insertions/removals backwards from end of string to start).
The result will be a list of changes as you specified in your question. For lots of changes it might become quite unreadable but it will nevertheless describe the complete history of your text.