Search code examples
pythondifflib

diff list of multiline strings with difflib without knowing which were added, deleted or modified


I have two lists of multiline strings and I try to get the the diff lines for these strings. First I tried to just split all lines of each string and handled all these strings as one big "file" and get the diff for it but I had a lot of bugs. I cannot just diff by index since I do not know, which multiline string was added, which was deleted and which one was modified.

Lets say I had the following example:

import difflib
oldList = ["one\ntwo\nthree","four\nfive\nsix","seven\neight\nnine"]
newList = ["four\nfifty\nsix","seven\neight\nnine","ten\neleven\ntwelve"]
oldAllTogether = []
for string in oldList:
    oldAllTogether.extend(string.splitlines())
newAllTogether = []
for string in newList:
    newAllTogether.extend(string.splitlines())
diff = difflib.unified_diff(oldAllTogether,newAllTogether)

So I somehow have to find out, which strings belong to each other.


Solution

  • I had to implmenent my own code in order to get the desired output. It is basically the same as Differ.compare() with the difference that we have a look at multiline blocks instead of lines. So the code would be:

    diffString = ""
    oldList = ["one\ntwo\nthree","four\nfive\nsix","seven\neight\nnine"]
    newList = ["four\nfifty\nsix","seven\neight\nnine","ten\neleven\ntwelve"]
    a = oldList
    b = newList
    cruncher = difflib.SequenceMatcher(None, a, b)
    for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
        if tag == 'replace':
            best_ratio, cutoff = 0.74, 0.75
            oldstrings = a[alo:ahi]
            newstrings = b[blo:bhi]
            for j in range(len(newstrings)):
                newstring = newstrings[j]
                cruncher.set_seq2(newstring)
                for i in range(len(oldstrings)):
                    oldstring = oldstrings[i]
                    cruncher.set_seq1(oldstring)
                    if cruncher.real_quick_ratio() > best_ratio and \
                      cruncher.quick_ratio() > best_ratio and \
                      cruncher.ratio() > best_ratio:
                        best_ratio, best_old, best_new = cruncher.ratio(), i, j
                if best_ratio < cutoff:
                    #added string
                    stringLines = newstring.splitlines()
                    for line in stringLines: diffString += "+" + line + "\n"
                else:
                    #replaced string
                    start = False
                    for diff in difflib.unified_diff(oldstrings[best_old].splitlines(),newstrings[best_new].splitlines()):
                        if start:
                            diffString += diff + "\n"
                        if diff[0:2] == '@@':
                            start = True
                    del oldstrings[best_old]
            #deleted strings
            stringLines = []
            for string in oldstrings:
                stringLines.extend(string.splitlines())
            for line in stringLines: diffString += "-" + line + "\n"
        elif tag == 'delete':
            stringLines = []
            for string in a[alo:ahi]:
                stringLines.extend(string.splitlines())
            for line in stringLines: 
                diffString += "-" + line + "\n"
        elif tag == 'insert':
            stringLines = []
            for string in b[blo:bhi]:
                stringLines.extend(string.splitlines())
            for line in stringLines: 
                diffString += "+" + line + "\n"
        elif tag == 'equal':
            continue
        else:
            raise ValueError('unknown tag %r' % (tag,))
    

    which result in the following:

    print(diffString)
     four
    -five
    +fifty
     six
    -one
    -two
    -three
    +ten
    +eleven
    +twelve