I have bunch of web document and want to remove the html tags from it. I saw some posts on StackOverflow on how to do in java, all from regex to HtmlCleaner and Jsoup.
I am interested in finding the fastest way to do it. I have millions of documents, so performance is crucial in my case. I can even trade a bit of quality for the performance.
Thanks for any answers in advance.
Seems like the java regexp is the fastest solution. However, it degrades the quality of the text obtained after.