I've got an html fragment as follows:
<span class=#article-title#>About《About<SomeChineseChars》Blabla</span>
sorry here I use latin chars since the editor does not allow to type Chinese chars
when I try to extract text out of this element using
doc.select(".article-title").text();
I will finally have the below as the result:
About《About》Blabla
after debugging the programming, finding that
<SomeChineseChars>
was treated as an HTML tag and JSoup close the tag automatically as follows
<SomeChineseChars></SomeChineseChars>
So, if there is anyway to avoid this from happening, or if this is a BUG?
-=-=-= UPDATE =-=-=-
after dom is built and then check the parsed html, the output is
I cannot post img, so plz click me to view it
Thanks a lot, Ben
I made up a solution by hacking into the JSoup as following:
customize a HtmlTreeBuilder
public class TroilaHtmlTreeBuilder extends HtmlTreeBuilder {
private String zh = "[\\u4e00-\\u9fa5]+";
public TroilaHtmlTreeBuilder() {
}
@Override
Element insert(Token.StartTag startTag) {
if (startTag.tagName.matches(zh)) {
Token.Character ch = new Token.Character();
ch.data(startTag.toString());
insert(ch);
return null;
}
return super.insert(startTag);
}
public Document parse(Reader input, String baseUri) {
return super.parse(input, baseUri, ParseErrorList.noTracking(), this.defaultSettings());
}
}
I don't think this is a good way to solve the problem, so let me know if you have any better idea.
BTW: many thanks to @Abhilash for your help!