I am using the diff-match-patch library in JavaScript to determine and visualize the differences between two configuration files for some components. It usually works well, however I've realized that for example in one line of a file there is the number "1657" and in the corresponding line in the other file there is the number "2160". In this case, I want it to strikethrough the whole number "1657" and show "2160" as a completely new addition, however it instead only strikes through the "57" and shows the "2" and "0" in "2160" as new additions. This is how it looks in the diff-match-patch demo:
I understand this behavior is due to how the algorithm of diff-match-patch works, it only sees "2" and "0" as new additions because they didn't exist in the previous string - but I am not sure how to fix it. I initially believed diff-match-patch doesn't do a character-by-character comparison, but the demo page states that it actually does.
This is the current state of the function where I use diff-match-patch:
function generateDiff(text1, text2) {
let dmp = new diff_match_patch();
let diff = dmp.diff_main(text1, text2);
dmp.diff_cleanupSemantic(diff);
let display1 = '';
let display2 = '';
for (let i = 0; i < diff.length; i++) {
let part = diff[i];
let op = part[0]; // Operation (insert, delete, equal)
let data = part[1]; // Text of the change
let span1 = document.createElement('span');
let span2 = document.createElement('span');
if (op === 0) { // Equal
span1.className = 'diff-equal';
span2.className = 'diff-equal';
span1.textContent = data;
span2.textContent = data;
display1 += span1.outerHTML;
display2 += span2.outerHTML;
} else if (op === -1) { // Delete
span1.className = 'diff-delete';
span1.textContent = data;
display1 += span1.outerHTML;
} else { // Insert
span2.className = 'diff-insert';
span2.textContent = data;
display2 += span2.outerHTML;
}
}
return [display1, display2];
}
I have tried to handle only numbers differently by using some regex and identifying them in the text, but it didn't work. I would appreciate any help on this, thank you!
By default, diff-match-patch
creates diffs on the basis of characters (strictly speaking, on the basis of UTF-16 code units). The Line or Word Diffs article in the docs explains how to implement other types of diffs: you need to convert each unique line/word into a unique char, diff the char strings, then convert back.
I created a fork of diff-match-patch
called diff-match-patch-unicode
, which essentially wraps the library in some convenience methods, defaulting to using Unicode code points rather than UTF-16 code units, and also providing an easy way of customizing the diff granularity through the segmenter
option:
import { Differ, segmenters } from 'jsr:@clearlylocal/diff-match-patch-unicode'
const differ = new Differ()
const str1 = '1657\nTime: 0s'
const str2 = '2160\nTime: 0s'
// default options (diff by char, similar to the current behavior you're seeing)
differ.diff(str1, str2)
// [
// Diff #[ 1, "2" ],
// Diff #[ 0, "16" ],
// Diff #[ -1, "57" ],
// Diff #[ 1, "0" ],
// Diff #[ 0, "\nTime: 0s" ]
// ]
// passing word segmenter as an option (diff by word)
differ.diff(str1, str2, { segmenter: segmenters.word })
// [
// Diff #[ -1, "1657" ],
// Diff #[ 1, "2160" ],
// Diff #[ 0, "\nTime: 0s" ]
// ]
If you'd rather use the original library, feel free to crib from the relevant parts of my Differ#diff
method and SegmentCodec
class, which provide the necessary abstraction. Alternatively, you can just roll-your-own based on the Line or Word Diffs docs linked above.
If you do decide to roll-your-own, and assuming words are the unit you want to use, I highly recommend using an Intl.Segmenter
instance such as new Intl.Segmenter('en-US', { granularity: 'word' })
to do the conversion.
Another alternative would be to still do a mainly char-based diff but treat numbers as a special case, i.e. the diff unit is something like "one-or-more digits OR any other single char". Using my fork, that'd look something like this:
differ.diff(str1, str2, { segmenter: (str) => str.match(/\d+|./gsu) ?? [] })