Search code examples
javadomparsingscreen-scrapingmozilla

Mozilla Parser for screen scraping


I'm writing an app that takes in HTML code of a page and extracts certain elements (such as tables) of the page and returns the html code for those elements. I'm attempting to do this in java using the Mozilla parser to simplify the navigation through the page, but I'm having trouble extracting the html code needed.

Maybe my whole approach is wrong, aka Mozilla parser, so if there are better solutions, I'm open to suggestions

String html = ///what ever the code is

MozillaParser p = // instantiate parser


// pass in html to parse which creates a dom object
Document d = p.parse(html);

// get a list of all the form elements in the page
NodeList l =  d.getElementsByTagName("form");

// iterate through all forms
for(int i = 0; i < l.getLength(); i++){

    // get a form
    Node n = l.item(i);

    // print out the html code for just this form.
    // This is the portion I haven't figured out.
    // I just made up the innerHTML method, but thats
    // the end result I'm desiring, a way to just see
    // the html code for a particular node
    System.out.println( n.innerHTML() );
}

Solution

  • Mozilla parser seems like overkill here, I've used Jericho with some success for just the type of thing you are doing.