Search code examples
htmlxmlparsingjtidy

JTidy node processing


I'm using JTidy in order to parse web page data. My question is the following:

It is possible to call the XPath.evalate method on a previously retrieved node?

I'll explain better. Usually you use the xmlPath.evaluate(pattern, document, XPathConstants.NODE) method call to retrieve a list of nodes matching your xpath expression.

Once tht i've retrieved a node or nodeList, how can I do xmlPath.evaluate starting from the previous retrieved node, something similar to xmlPath.evaluate(pattern, node, XPathConstants.NODE)


Solution

  • Yes, I think it is possible:

    URL url = new URL("http://www.w3.org");
    
    // configure JTidy
    Tidy tidy = new Tidy();
    tidy.setXHTML(true);
    tidy.setQuiet(true);
    tidy.setXmlOut(true);
    tidy.setShowWarnings(false);
    
    Document doc = tidy.parseDOM(url.openConnection().getInputStream(), null);
    XPath xpath = XPathFactory.newInstance().newXPath();
    
    XPathExpression expr =
    xpath.compile("//form[@action = 'http://www.w3.org/Help/search']");
    
    Node form = (Node) expr.evaluate(doc, XPathConstants.NODE);
    
    // create relative XPath    
    expr = xpath.compile("ul/li[@class = 'last-item']/a");
    Node lastItem = (Node) expr.evaluate(form, XPathConstants.NODE);
    
    System.out.println(lastItem.getFirstChild().getNodeValue());
    

    Returns:

    About W3C