Search code examples
pythonhtmlpython-2.7string-comparisondifflib

Comparing HTML with difflib


I'm looking to get reliable diffs of content only (structural changes will be rare and therefore can be ignored) of this page. More specifically, the only change I need to pick up is a new Instruction ID added:

enter image description here

To get a feel for what difflib will produce, I first diff two identical HTML contents, hoping to get nothing back:

url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib.urlopen(url
content = response.read()
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)

Since difflib mimics the UNIX diff utility, I would expect diffed to contain nothing (or give some indication that the sequences were identical, yet yet if I '\n'.join diffed, I get something resembling HTML, (although it doesn't render in a browser)

Indeed, if I take the simplest case possible of diffing two characters:

diffed = d.compare('a', 'a')

diffed.next() produces the following:

'  a'

So I am either expecting something from difflib that it can't or won't provide (and I should change tack), or am I misusing it? What are viable alternatives for diffing HTML?


Solution

  • The arguments to Differ.compare() are supposed to be sequences of strings. If you use two strings they will be each treated as sequence and therefore compared character by character.

    So your example should be rewritten as:

    url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
    response = urllib.urlopen(url)
    content = response.readlines()  # get response as list of lines
    import difflib
    d = difflib.Differ()
    
    diffed = d.compare(content, content)
    print('\n'.join(diffed))
    

    If you only want to compare the content of a html file, you should probably use a parser to process it and get only text without tags, e.g. by using BeautifulSoup's soup.stripped_strings:

    soup = bs4.BeautifulSoup(html_content)
    diff = d.compare(list(soup.stripped_strings), list_to_compare_to)
    print('\n'.join(diff))