java gwt html-parsing jericho-html-parser

jericho-html - text extracting and incorrect text lenght

Today I tried to use the lib as jericho-html-3.2 to extract text from simple html... And I faced a strange text fake length problem as follows:

if I have html as this one

Hello World :)<br><br>Hello World :(<br><br>Hello World ;)<br>

...my RichTextArea getText().length() returns 42 that is correct length actually but when I try to extract text from this html with code like a

        Source source = new Source(html);
    String text = source.getTextExtractor().toString();

... the text.length() returns 44

So I don't get it why text which length is 42 turns into text which length is 44 and how to fix it?

Thanks

Solution

I had to dig it deeper and I think the wrong text length becomes from html line breakers because the jericho html-parser for some reason replaces new line breakers with spaces or something...

As for now, I cannot say for sure which more tags does it replace to which characters but as for my case I just tried to do some walk-around using regular expression like this (see snippet)

html=html.replaceAll("<br>","");

Source source = new Source(html);
String text = source.getTextExtractor().toString();

... so now it really returns original text length as 42 :)

I hope the tip saves one day

Thank you all for help