Search code examples
javahtml-parsingjsouphtmlhtmlcleaner

What is the fastest way to remove html tags from a document in java?


I have bunch of web document and want to remove the html tags from it. I saw some posts on StackOverflow on how to do in java, all from regex to HtmlCleaner and Jsoup.

I am interested in finding the fastest way to do it. I have millions of documents, so performance is crucial in my case. I can even trade a bit of quality for the performance.

Thanks for any answers in advance.


Solution

  • Seems like the java regexp is the fastest solution. However, it degrades the quality of the text obtained after.