Search code examples
javahtmlxmlxpathxhtml

How to get the XPath of an element in HTML in java?


I want to achieve a simple task, but I'm struggling to find an easy solution for that: I have the HTML of a webpage in a String (or File) and I'd like to generate the XPath of a given element. (For example I'd like to retrieve the XPath for an <a> element)

I tried different solutions but I'm constantly encountering problems in parsing the html correctly. Is there a functioning html cleaner for java like this one? https://www.htmlwasher.com/ This is the ONLY functioning cleaner I've find out for now, but it is an online tool. With this I can easily parse the HTML and get to the XPath.

I'm currently using jOOX (https://github.com/jOOQ/jOOX) this way to generate the XPath:

Document document = $(html).document();
System.out.println($(document).find("a").xpath());

If the HTML is cleaned with the online tool I provided, I can generate the right XPath. I like the way I could interact with jOOX if only I could correctly and programmatically parse the html. Do you know a good way to parse the HTML? I already tried:

  • JSoup
  • Tagsoup
  • HtmlCleaner

The testing website page is http://www.ansa.it.

EDIT: The parsing was failing on some common HTML parsing problems like unclosed tags ( </img> for example), escaping, etc.

I managed to parse "correctly" the html this way:

    Document doc = Jsoup.parse(Jsoup.clean(html, Whitelist.relaxed()));
doc.outputSettings().escapeMode(EscapeMode.xhtml)
                        .syntax(Syntax.xml)
                        .charset(StandardCharsets.UTF_8);

Fact is that tags like <a href="cinema.shtml">Cinema</a> became <a>Cinema</a> so I'm not able to select them using their attributes, like href. How can I solve this new problem?

I noticed that some links still have their href and they are the ones which point to other websites like Facebook or Twitter. Could this be related?


Solution

  • SOLVED:

    I managed to get all things to work this way:

    String html = getTheHTMLSomeWay();
    
    Document doc = Jsoup.parse(Jsoup.clean(html, "http://base.uri", Whitelist.relaxed().preserveRelativeLinks(true)));
    doc.outputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).charset(StandardCharsets.UTF_8);
    
    org.w3c.dom.Document document = $(doc.html()).document();
    
    System.out.println($(document).find("a[href='/your/relative/url']"));
    

    With Jsoup i can clean the HTML against all that boring unclosed tags, not allowed tags etc. Then i can escape all the unescaped characters (according to xhtml) and set the syntax to xml.

    That can give you a clean html code, usable with jOOX library.