Search code examples
pythonalgorithmversion-controlwiki

Tracking changes in lines or paragraphs between text file revisions


I'm investigating building an inline commenting system over a wiki (but this is not the question). The first thing that comes to mind is being able to associate a comment to a line, paragraph or string with a unique ID. Then to have this association remain valid over multiple edits of the wiki.

For example, from this input:

R1:

1 One
2 Two
3 
4 Three
5 Four

R2:

1 One
2 Three
3 Two
4
5 Four

I'd like to have a mapping between lines 1⇒1, 2⇒3, 3⇒4, 5⇒5 and know a new line 2 was added and 4 was removed. Dealing with line 4 moving to line 2 is probably starting to make some wild guesses, I'm not sure.

There are lots of tools that do similar text operations (no, I'm not after a tool). Diff, version control and revision merging come to mind. Are there well known algorithms to track content between file changes? My preference is for python but am interested to know about others too.


Solution

  • Since you have heard of diff, you are probably interested in the Hunt-McIlroy algorithm on which (and other tools of its type) it is based. By the way, diff is available in Python as difflib.

    Actual revision tracking (like Word does when "Track Changes" is on) requires tagging the text in some way and can't really be done with plain text. But diff and merge algorithms are pretty damn good at figuring this out.