I want to achieve a simple task, but I'm struggling to find an easy solution for that: I have the HTML of a webpage in a String (or File) and I'd like to generate the XPath of a given element.
(For example I'd like to retrieve the XPath for an <a>
element)
I tried different solutions but I'm constantly encountering problems in parsing the html correctly. Is there a functioning html cleaner for java like this one? https://www.htmlwasher.com/ This is the ONLY functioning cleaner I've find out for now, but it is an online tool. With this I can easily parse the HTML and get to the XPath.
I'm currently using jOOX (https://github.com/jOOQ/jOOX) this way to generate the XPath:
Document document = $(html).document();
System.out.println($(document).find("a").xpath());
If the HTML is cleaned with the online tool I provided, I can generate the right XPath. I like the way I could interact with jOOX if only I could correctly and programmatically parse the html. Do you know a good way to parse the HTML? I already tried:
The testing website page is http://www.ansa.it.
EDIT:
The parsing was failing on some common HTML parsing problems like unclosed tags ( </img>
for example), escaping, etc.
I managed to parse "correctly" the html this way:
Document doc = Jsoup.parse(Jsoup.clean(html, Whitelist.relaxed()));
doc.outputSettings().escapeMode(EscapeMode.xhtml)
.syntax(Syntax.xml)
.charset(StandardCharsets.UTF_8);
Fact is that tags like <a href="cinema.shtml">Cinema</a>
became <a>Cinema</a>
so I'm not able to select them using their attributes, like href. How can I solve this new problem?
I noticed that some links still have their href and they are the ones which point to other websites like Facebook or Twitter. Could this be related?
SOLVED:
I managed to get all things to work this way:
String html = getTheHTMLSomeWay();
Document doc = Jsoup.parse(Jsoup.clean(html, "http://base.uri", Whitelist.relaxed().preserveRelativeLinks(true)));
doc.outputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).charset(StandardCharsets.UTF_8);
org.w3c.dom.Document document = $(doc.html()).document();
System.out.println($(document).find("a[href='/your/relative/url']"));
With Jsoup i can clean the HTML against all that boring unclosed tags, not allowed tags etc. Then i can escape all the unescaped characters (according to xhtml) and set the syntax to xml.
That can give you a clean html code, usable with jOOX library.