Today I tried to use the lib as jericho-html-3.2 to extract text from simple html... And I faced a strange text fake length problem as follows:
if I have html as this one
Hello World :)<br><br>Hello World :(<br><br>Hello World ;)<br>
...my RichTextArea getText().length()
returns 42 that is correct length actually but when I try to extract text from this html with code like a
Source source = new Source(html);
String text = source.getTextExtractor().toString();
... the text.length()
returns 44
So I don't get it why text which length is 42 turns into text which length is 44 and how to fix it?
Thanks
I had to dig it deeper and I think the wrong text length becomes from html line breakers because the jericho html-parser for some reason replaces new line breakers with spaces or something...
As for now, I cannot say for sure which more tags does it replace to which characters but as for my case I just tried to do some walk-around using regular expression like this (see snippet)
html=html.replaceAll("<br>","");
Source source = new Source(html);
String text = source.getTextExtractor().toString();
... so now it really returns original text length as 42 :)
I hope the tip saves one day
Thank you all for help