I'm using the python difflib
to calculate the diff between two plaintext English paragraphs.
The paragraphs are very similar- one has an extra leading and ending sentence. There are also minor differences between the characters.
Unfortunately, I'm getting very bad results. It seems like a character in the beginning of the diff is throwing it off and it sprinkles random characters throughout.
Websites like diffchecker.com have no problem calculating the diff. I also notice that if I decrease the window of the difflib to ignore the first sentence, it computes the diff correctly. Has anyone else noticed this issue?
Attaching my code and the sample passages below. Thanks so much.
import difflib
s1 = "Ableton Live also supports Audio To MIDI, which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody, Harmony, or Rhythm. Once finished, Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.[14] See Fourier transform.Envelopes[edit]Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips, in which case they will be used in every performance of that clip, or on the entire arrangement. The most obvious examples are volume and track panning, but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls, which can also control parameters in real-time using sliders, faders and such. Using the global transport record function will also record changes made to these parameters, creating an envelope for them.User interface[edit]Much of Live’s interface comes from being designed for use in live performance, as well as for production.[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help box)."
s2 = "Once finished, Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes. [14] See Fourier transform . Envelopes[ edit ] Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips, in which case they will be used in every performance of that clip, or on the entire arrangement. The most obvious examples are volume and track panning, but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls, which can also control parameters in real-time using sliders, faders and such. Using the global transport record function will also record changes made to these parameters, creating an envelope for them. User interface[ edit ] Much of Live’s interface comes from being designed for use in live performance, as well as for production."
if __name__ == "__main__":
res = [d for d in difflib.ndiff(s1, s2)]
print(res)
As the docs say,
Compare a and b (lists of strings) ... return a Differ-style delta (a generator generating the delta lines).
ndiff()
is intended to, e.g., compare two files, given lists of the lines
the files contain. Much like the common Unixy diff
utility.
You're trying to compare two individual lines. difflib
doesn't have a built-in "pretty-printed" way to do that, but does supply comparison facilities on which you can build whatever formatting you like. For example,
d = difflib.SequenceMatcher(None, s1, s2, autojunk=None)
for op in d.get_opcodes():
print(op)
prints
('delete', 0, 194, 0, 0)
('equal', 194, 446, 0, 252)
('insert', 446, 446, 252, 253)
('equal', 446, 472, 253, 279)
('insert', 472, 472, 279, 280)
('equal', 472, 473, 280, 281)
('insert', 473, 473, 281, 282)
('equal', 473, 483, 282, 292)
('insert', 483, 483, 292, 293)
('equal', 483, 487, 293, 297)
('insert', 487, 487, 297, 298)
('equal', 487, 488, 298, 299)
('insert', 488, 488, 299, 300)
('equal', 488, 1143, 300, 955)
('insert', 1143, 1143, 955, 956)
('equal', 1143, 1158, 956, 971)
('insert', 1158, 1158, 971, 972)
('equal', 1158, 1162, 972, 976)
('insert', 1162, 1162, 976, 977)
('equal', 1162, 1163, 977, 978)
('insert', 1163, 1163, 978, 979)
('equal', 1163, 1269, 979, 1085)
('delete', 1269, 1508, 1085, 1085)
See the docs for the precise meanings of those. They succinctly describe what's needed to change s1
into s2
. The long exactly matching block is described by the ('equal', 488, 1143, 300, 955)
, and, indeed,
>>> s1[488 : 1143] == s2[300 : 955]
True
Suggestion: instead, break your two inputs into sentences, and view each input instead as a sequence (like a list) of newline-terminated sentences. Then you could use ndiff()
directly, in the way it's intended to be used.
Making the other way more concrete, for example this code:
import difflib
d = difflib.SequenceMatcher(None, s1, s2, autojunk=None)
for op, i1, i2, j1, j2 in d.get_opcodes():
print(">>> ", end="")
if op == "equal":
print(f"{i2-i1} characters the same at",
f"{i1}:{i2} and {j1}:{j2}")
print(s1[i1:i2])
elif op == "delete":
print(f"delete {i2-i1} characters at {i1}:{i2}")
print(s1[i1:i2])
elif op == "insert":
print(f"insert {j2-j1} characters from {j1}:{j2}")
print(s2[j1:j2])
elif op == "replace":
print(f"replace {i1}:{i2} with {j1}:{j2}")
print(s1[i1:i2])
print(s2[j1:j2])
else:
assert False, ("unknown op", repr(op))
produces this output:
>>> delete 194 characters at 0:194
Ableton Live also supports Audio To MIDI, which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody, Harmony, or Rhythm.
>>> 252 characters the same at 194:446 and 0:252
Once finished, Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.
>>> insert 1 characters from 252:253
>>> 26 characters the same at 446:472 and 253:279
[14] See Fourier transform
>>> insert 1 characters from 279:280
>>> 1 characters the same at 472:473 and 280:281
.
>>> insert 1 characters from 281:282
>>> 10 characters the same at 473:483 and 282:292
Envelopes[
>>> insert 1 characters from 292:293
>>> 4 characters the same at 483:487 and 293:297
edit
>>> insert 1 characters from 297:298
>>> 1 characters the same at 487:488 and 298:299
]
>>> insert 1 characters from 299:300
>>> 655 characters the same at 488:1143 and 300:955
Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips, in which case they will be used in every performance of that clip, or on the entire arrangement. The most obvious examples are volume and track panning, but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls, which can also control parameters in real-time using sliders, faders and such. Using the global transport record function will also record changes made to these parameters, creating an envelope for them.
>>> insert 1 characters from 955:956
>>> 15 characters the same at 1143:1158 and 956:971
User interface[
>>> insert 1 characters from 971:972
>>> 4 characters the same at 1158:1162 and 972:976
edit
>>> insert 1 characters from 976:977
>>> 1 characters the same at 1162:1163 and 977:978
]
>>> insert 1 characters from 978:979
>>> 106 characters the same at 1163:1269 and 979:1085
Much of Live’s interface comes from being designed for use in live performance, as well as for production.
>>> delete 239 characters at 1269:1508
[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help box).
You can edit that template to display results in any way you like best.