Search code examples
pythonweb-scrapinghtml-parsinghtml

Web scraping - how to identify main content on a webpage


Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

What's a generic way of doing this that will work on most major news sites?

What are some good tools or libraries for data mining? (preferably python based)


Solution

  • There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.