Search code examples
javascriptdomphantomjscasperjsdom-traversal

How to traverse the dom tree of a website and get all the elements in CasperJS?


I'm new to web development and the task is to find all the elements on a webpage(For example, here I wanna to find all the elements on Amazon, including header, footer, navbar etc.), and then get the location and size of all of them.(Including height, width, top, bottom, left, right etc.) I try to use CasperJS and PhantomJS to do it, and here is my code:

casper.start('http://www.amazon.com/s?url=search-alias=aps&field-keywords=watches', function(){
});

var objarr = [];

casper.then(function(){
  var tmp = this.evaluate(function() {
    return document.getElementsByTagName("html")[0]; //get the html and traverse all it children
  }
  traverseDOMTree(tmp);

  for (var i = 0; i < objarr.length; i++){
        var isvalid = judge(objarr[i]); //judge whether the elemnet is null.
        console.log(i+1);
        if (isvalid && i != 0) {
          console.log(objarr[i].textContent);
        }
  }
});

function traverseDOMTree(root) //traverse function
{
  if (root)
  {
    for (var i = 0; i < root.childNodes.length; i++){
        objarr.push(root.childNodes[i]);
        traverseDOMTree(root.childNodes[i]);
    }
  }
}

function judge(obj){ 
  if (obj == null) { 
    console.log("The object is NULL");
    return false;
  }
  //If it is not null, get its location and height with width
  console.log("___________________________");
  console.log("The offsetTop is ", obj.offsetTop);
  console.log("The offsetLeft is ", obj.offsetLeft);
  console.log("The height is", obj.clientHeight);
  console.log("The width is", obj.clientWidth);
}

So my method is first get the root of the DOM tree, which is document.getElementsByTagId("html")[0]. And then I traverse of all of its children and put all of the elements I find into an array. However, there are several problems here:

  1. Most of the elements I find are null objects.
  2. The traverse function seems only work on the same level and will not continue to traverse.
  3. The CasperJS seems not stable to work, as I will get different problems/warning each time when I try to run.

I've debugged and tried different ways for a long time, but I still can't succeed. I guess I need to put my traverse function into the casper.evaluate(), but there is too little tutorial about how to use it on the web. So is there anyone can help me to find a feasible way to do this?


Solution

  • CasperJS is built on top of PhantomJS and inherits some of it shortcomings like the two distinct contexts. You can only access the DOM (page context) through the sandboxed casper.evaluate() function. It cannot use variables that are defined outside and everything that you pass in or out has to be a primitive. DOM nodes are not primitives. See the docs (page.evaluate()):

    Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

    Closures, functions, DOM nodes, etc. will not work!

    This means that you have to do everything inside of the page context, because you're directly working on those DOM nodes. You can pass the results out of the page context when you're done traversing.

    Or you can simply move everything inside the page context and register to the "remote.message" event:

    casper.on("remote.message", function(msg){
        this.echo("remote> " + msg);
    });
    
    casper.then(function(){
        this.evaluate(function() {
            var tmp = document.getElementsByTagName("html")[0]; //get the html and traverse all it children
            var objarr = [];
    
            traverseDOMTree(tmp);
    
            for (var i = 0; i < objarr.length; i++){
                var isvalid = judge(objarr[i]); //judge whether the elemnet is null.
                console.log(i+1);
                if (isvalid && i != 0) {
                  console.log(objarr[i].textContent);
                }
            }
    
            function traverseDOMTree(root) //traverse function
            {
                if (root)
                {
                    for (var i = 0; i < root.childNodes.length; i++){
                        objarr.push(root.childNodes[i]);
                        traverseDOMTree(root.childNodes[i]);
                    }
                }
            }
    
            function judge(obj){ 
                if (obj == null) { 
                    console.log("The object is NULL");
                    return false;
                }
                //If it is not null, get its location and height with width
                console.log("___________________________");
                console.log("The offsetTop is ", obj.offsetTop);
                console.log("The offsetLeft is ", obj.offsetLeft);
                console.log("The height is", obj.clientHeight);
                console.log("The width is", obj.clientWidth);
                return true;
            }
        }
    });