Search code examples
algorithmtextweb-scrapingtext-extraction

Body Text extraction from websites e.g. extract only article heading and text not all text in site


I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this.

So for example for a news article I would like to identify the heading and all the text, but not the comments section and so on.

Are there any algorithms for that out there? Thank you!


Solution

  • In computer science literature this problem is usually referred to as the page segmentation or boiler plate detection problem. See the report Boilerplate Detection using Shallow Text Features. Also, I have a few reports and software sites bookmarked that address the problem. Also, see this stackoverflow question.