Search code examples
jsoup

jsoup mistakes a token as an HTML tag


I've got an html fragment as follows:

<span class=#article-title#>About《About<SomeChineseChars》Blabla</span>

sorry here I use latin chars since the editor does not allow to type Chinese chars

when I try to extract text out of this element using

doc.select(".article-title").text();

I will finally have the below as the result:

About《About》Blabla 

after debugging the programming, finding that

<SomeChineseChars> 

was treated as an HTML tag and JSoup close the tag automatically as follows

<SomeChineseChars></SomeChineseChars> 

So, if there is anyway to avoid this from happening, or if this is a BUG?

-=-=-= UPDATE =-=-=-

after dom is built and then check the parsed html, the output is

I cannot post img, so plz click me to view it

Thanks a lot, Ben


Solution

  • I made up a solution by hacking into the JSoup as following:

    1. create a new package named org.jsoup.parser;
    2. customize a HtmlTreeBuilder

      public class TroilaHtmlTreeBuilder extends HtmlTreeBuilder {
      
      private String zh = "[\\u4e00-\\u9fa5]+";
      
      public TroilaHtmlTreeBuilder() {
      }
      
      @Override
      Element insert(Token.StartTag startTag) {
          if (startTag.tagName.matches(zh)) {
              Token.Character ch = new Token.Character();
              ch.data(startTag.toString());
              insert(ch);
              return null;
          }
          return super.insert(startTag);
      }
      
      public Document parse(Reader input, String baseUri) {
          return super.parse(input, baseUri, ParseErrorList.noTracking(), this.defaultSettings());
      }
      
      }
      

    I don't think this is a good way to solve the problem, so let me know if you have any better idea.

    BTW: many thanks to @Abhilash for your help!