Search code examples
linuxapachewebservermigrationcompare

Compare two websites and see if they are "equal?"


We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?


Solution

  • Get the formatted output of both sites (here we use w3m, but lynx can also work):

    w3m -dump http://google.com 2>/dev/null > /tmp/1.html
    w3m -dump http://google.de 2>/dev/null > /tmp/2.html
    

    Then use wdiff, it can give you a percentage of how similar the two texts are.

    wdiff -nis /tmp/1.html /tmp/2.html
    

    It can be also easier to see the differences using colordiff.

    wdiff -nis /tmp/1.html /tmp/2.html | colordiff
    

    Excerpt of output:

    Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
    [-iGoogle |-]
    Paramètres | Connexion
    
                               Google [hp1] [hp2]
                                      [hp3] [-Français-] {+Deutschland+}
    
               [                                                         ] Recherche
                                                                           avancéeOutils
                          [Recherche Google][J'ai de la chance]            linguistiques
    
    
    /tmp/1.html: 43 words  39 90% common  3 6% deleted  1 2% changed
    /tmp/2.html: 49 words  39 79% common  9 18% inserted  1 2% changed
    

    (he actually put google.com into french... funny)

    The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).