Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

What's a generic way of doing this that will work on most major news sites?

What are some good tools or libraries for data mining? (preferably python based)

Solution

There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.