web page change detection

Currently i am doing my project/thesis for the last semester, and i thought of doing it on "detecting the webpage changes in web". I have read two paper on this topic but i have some confusions

1. in a paper entitled

An enhanced web page change detection algorithm with application to speeding up mobile web page transcoding 1

it is written

first generate subtrees from HTML documents, where each subtree is given a mark according to its tag contents.

My question is here how to generate the subtrees from the HTML documents ?? what is the technique for doing so. and the next question what is it saying by "giving a mark according to its tag contents".

2. please look at the image here!! General diagram of proposed approach

In “Calculate most similar sub-trees” box how matching is done?? in another paper which is entitled

An Efficient Web Page Change Detection System Based on an Optimized Hungarian Algorithm [2]

Hungarian algorithm is used for matching, a line is quoted from the paper entitled

A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations [3]

the approach in [2] uses the O(N³)Hungarian algorithm to compute the maximum weighted matching on a weighted bipartite graph and has a running time in O(N₂ x N₁³) , where N₁ and N₂ are, respectively, the number of nodes in the old page and in the new (changed) page.” my question is, as the subtrees are forming why weights are being added, and how they are added ?

Thanks for reading my questions/confusions, i really need help here and little soon, please anyone help me with this one, i shall be always grateful.

Solution

I have a really easy and fast way of doing this I think.

I recently wrote and released jqgram for calculating tree edit distance approximation with an easy-to-use API for comparing DOM-like structures, JSON structures, or tree structures of your own design:

https://github.com/hoonto/jqgram

Based on original paper: http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf Originally ported from Python implementation: https://github.com/Sycondaman/PyGram

The jqgram tree edit distance approximation module implements the PQ-Gram algorithm for both server-side and browser-side applications; O(n log n) time and O(n) space performant where n is the number of nodes. The PQ-Gram approximation is much faster than obtaining the true edit distance via Zhang & Shasha, Klein, Guha, or others, whom provide true edit distance algorithms that perform at best O(n^2) or O(n^3) depending on which algorithm you look at.

Here's a start on how I would use jqgram for your specific challenge which I took straight out of the README on github. To answer one of your questions, you can use the DOM as its own tree structure, in a library such as jQuery (as shown below) or replicate it or generate one from html string in Cheerio or its underlying HTML parsing library or any combination thereof (jqgram provides you this flexibility). The example here compares the DOM in the current page to a Cheerio representation generated from a string - your known reference.

// This could probably be optimized significantly, but is a real-world
// example of how to use tree edit distance in the browser.

// For cheerio, you'll have to browserify, 
// which requires some fiddling around
// due to cheerio's dynamically generated 
// require's (good grief) that browserify 
// does not see due to the static nature 
// of its code analysis (dynamic off-line
// analysis is hard, but doable).
//
// Ultimately, the goal is to end up with 
// something like this in the browser:
var cheerio = require('./lib/cheerio');

// But you could use jQuery for both sides of this comparison in which case your
// lfn and cfn callback functions become the same for both roots. 

// The easy part, jqgram:
var jq = require("../jqgram").jqgram;

// Make a cheerio DOM:
var html = '<body><div id="a"><div class="c d"><span>Irrelevent text</span></div></div></body>';

var cheeriodom = cheerio.load(html, {
    ignoreWhitespace: false,
    lowerCaseTags: true
});

// For ease, lets assume you have jQuery laoded:
var realdom = $('body');

// The lfn and cfn functions allow you to specify
// how labels and children should be defined:
jq.distance({
    root: cheeriodom,
    lfn: function(node){ 
        // We don't have to lowercase this because we already
        // asked cheerio to do that for us above (lowerCaseTags).
        return node.name; 
    },
    cfn: function(node){ 
        // Cheerio maintains attributes in the attribs array:
        // We're going to put id's and classes in as children 
        // of nodes in our cheerio tree
        var retarr = []; 
        if(!! node.attribs && !! node.attribs.class){
            retarr = retarr.concat(node.attribs.class.split(' '));
        }
        if(!! node.attribs && !! node.attribs.id){
            retarr.push(node.attribs.id);
        }
        retarr = retarr.concat(node.children);
        return  retarr;
    }
},{
    root: realdom,
    lfn: function(node){ 
        return node.nodeName.toLowerCase(); 
    },
    cfn: function(node){ 
        var retarr = [];
        if(!! node.attributes && !! node.attributes.class && !! node.attributes.class.nodeValue){
            retarr = retarr.concat(node.attributes.class.nodeValue.split(' '));
        }
        if(!! node.attributes && !! node.attributes.id && !! node.attributes.id.nodeValue) {
            retarr.push(node.attributes.id.nodeValue);
        }
        for(var i=0; i<node.children.length; ++i){
            retarr.push(node.children[i]);
        }
        return retarr;
    }
},{ p:2, q:3, depth:10 },
function(result) {
    console.log(result.distance);
});

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can compare your DOM to a JSON object or something else that uses different semantics for specifying what are children and what are node labels. Note also in this example that I utilize the DOM entity class attribute, splitting it into its individual classes and the id of the DOM node itself as immediate children of the node in order to provide more clarity on whether two trees are very similar or very different. You can extend that to include your own attributes. Or, you can also modify the lfn functions for each tree to include the id in the label like "tagname:id" - its up to you and will change how the algorithm performs - perhaps something interesting to investigate in your research.

So to summarize all you need to do is provide those lfn and cfn functions along with each root and jqgram will do the rest, calling the lfn and cfn functions to build out the trees.

The PQ-Gram algorithm implemented by jqgram will provide the edit distance as a number between zero and one and it should be noted that a value of zero does not necessarily indicate absolute equality, only that the two trees are very similar. If you need to go on to determine if two very similar trees as determined by jqgram are indeed identical you can use Zhang and Shasha, but using jqgram to get the metrics will save you a ton of computations which become extremely critical in client-side browser applications where end-user performance is obviously important.

Hope that helps!