Search code examples
pythonnlpdifflib

Text comparison to ignore some newline characters in Python


I have a Python script that compares two texts and highlights the differences between them. However, the comparison is being affected by newline characters, causing mismatches for texts with different newline representations. For instance, "arti\ncle" and "article" are being treated as different.

I'm currently using the difflib

Here's a simplified version of my current code:

import difflib

def compare_texts(old_text, new_text):
    old_lines = old_text.splitlines()
    new_lines = new_text.splitlines()
    
    d = difflib.Differ()
    diff = d.compare(old_lines, new_lines)
    
    added_lines = []
    deleted_lines = []
    
    for line in diff:
        if line.startswith('+ '):
            added_lines.append(line[2:])
        elif line.startswith('- '):
            deleted_lines.append(line[2:])
    
    return added_lines, deleted_lines

if __name__ == "__main__":
    old_text = "arti\ncle\nthis is some old text."
    new_text = "article\nthis is some new text."
    
    added_lines, deleted_lines = compare_texts(old_text, new_text)
    
    print("Added lines:")
    print('\n'.join(added_lines))
    
    print("\nDeleted lines:")
    print('\n'.join(deleted_lines))

Can someone suggest an effective way to compare texts that will handle newline characters appropriately, ensuring that "arti\ncle" and "article" are treated as the same during the comparison process?

EDIT1: In fact, lots of "\n" are introduced due to a pdf reading function. The idea maybe the following: if there is a "\n", we can try to delete it. If, after deleting it, we have a match, then we can consider that they are the same.

So "article" and "arti\ncle" are the same. "article" and "arti\nficial" are not.

I can't remove all "\n" because many of them are still useful.

EDIT2: knowing the origins of the bugs, we also may try this approach. Some random "\n" have been added due to a pdf reading function, so, we can try to delete some meaningless "\n" first.


Solution

  • Here's a suggested solution:

    • I think you need to do a wordwise diff, not linewise. So replace spaces with linebreaks. (Or use a different diffing method)
    • Then check if two consecutively deleted or inserted lines joined together match with a neighboring inserted/deleted line
    import difflib
    
    def compare_texts(old_text, new_text):
        old_lines = old_text.splitlines()
        new_lines = new_text.splitlines()
        
        d = difflib.Differ()
        diff = d.compare(old_lines, new_lines)
        
        added_lines = []
        deleted_lines = []
        
        prev = None
        prev_prev = None
        for line in diff:
            if line.startswith('+ '):
                added_lines.append(line[2:])
            elif line.startswith('- '):
                deleted_lines.append(line[2:])
            if prev is not None and prev_prev is not None:
                # handle + - - 
                if prev_prev.startswith('+ ') and prev.startswith('- ') and line.startswith('- '):
                    joined = prev[2:]  + line[2:]
                    if joined == prev_prev[2:]:
                        # can remove diffs as they make up the same word
                        del added_lines[-1]
                        del deleted_lines[-1]
                        del deleted_lines[-1]
                # also handle   - - +    + + -    - + + 
            prev_prev = prev
            prev = line
        
        return added_lines, deleted_lines
    
    if __name__ == "__main__":
        old_text = "arti\ncle\nthis is some old text."
        new_text = "article\nthis is some new text."
        
        added_lines, deleted_lines = compare_texts(old_text.replace(" ", "\n"), new_text.replace(" ", "\n"))
        
        print("Added lines:")
        print('\n'.join(added_lines))
        
        print("\nDeleted lines:")
        print('\n'.join(deleted_lines))
    

    You need to handle the other cases, I only implemented + - -.

    This solution assumes only one linebreak can be in a word. And all 'good' linebreaks are lost.