Search code examples
javagwthtml-parsingjericho-html-parser

jericho-html - text extracting and incorrect text lenght


Today I tried to use the lib as jericho-html-3.2 to extract text from simple html... And I faced a strange text fake length problem as follows:

if I have html as this one

Hello World :)<br><br>Hello World :(<br><br>Hello World ;)<br>

...my RichTextArea getText().length() returns 42 that is correct length actually but when I try to extract text from this html with code like a

        Source source = new Source(html);
    String text = source.getTextExtractor().toString();

... the text.length() returns 44

So I don't get it why text which length is 42 turns into text which length is 44 and how to fix it?

Thanks


Solution

  • I had to dig it deeper and I think the wrong text length becomes from html line breakers because the jericho html-parser for some reason replaces new line breakers with spaces or something...

    As for now, I cannot say for sure which more tags does it replace to which characters but as for my case I just tried to do some walk-around using regular expression like this (see snippet)

    html=html.replaceAll("<br>","");
    
    Source source = new Source(html);
    String text = source.getTextExtractor().toString();
    

    ... so now it really returns original text length as 42 :)

    I hope the tip saves one day


    Thank you all for help