I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this.
So for example for a news article I would like to identify the heading and all the text, but not the comments section and so on.
Are there any algorithms for that out there? Thank you!
In computer science literature this problem is usually referred to as the page segmentation or boiler plate detection problem. See the report Boilerplate Detection using Shallow Text Features. Also, I have a few reports and software sites bookmarked that address the problem. Also, see this stackoverflow question.