Search code examples
javaspring-bootxpathjsouphtml-parsing

Delete element from HTML by raw xpath using jsoup or any other library


I am trying to delete an element from a HTML with a raw xpath.

        final Document document = Jsoup.parse(htmlAsString);
        final Elements elements = document.select("/html/head");
        elements.forEach(Node::remove);

But following error has encountered,

org.jsoup.select.Selector$SelectorParseException: Could not parse query '/html/head': unexpected token at '/html/head'
at org.jsoup.select.QueryParser.findElements(QueryParser.java:206)
at org.jsoup.select.QueryParser.parse(QueryParser.java:59)
at org.jsoup.select.QueryParser.parse(QueryParser.java:42)
at org.jsoup.select.Selector.select(Selector.java:91)
at org.jsoup.nodes.Element.select(Element.java:372)

Is there a way to process raw xpath from html to get/delete an element.


Solution

  • jsoup natively supports a set of CSS selectors, not xpath. You could just to this:

    Document doc = Jsoup.parse(html);
    document.select("html > head").remove();
    

    (See the Selector syntax and Elements#remove() documentation.)

    If you need to use xpath specifically (why?), you can use jsoup's W3C Dom converter to convert a jsoup Document into a W3C Document (Java XML), and run xpath queries against that:

    import org.w3c.dom.Document;
    import org.w3c.dom.Node;
    ...
    
    org.jsoup.nodes.Document jdoc = Jsoup.parse(html);
    Document w3doc = W3CDom.convert(jdoc);
    
    String query = "/html/head";
    XPathExpression xpath = XPathFactory.newInstance().newXPath().compile(query);
    Node head = (Node) xpath.evaluate(w3doc, XPathConstants.NODE);