html screen-scraping text-extraction html-content-extraction

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text. For example, it would pick the div "content" in the following HTML:

<html>
   <body>
      <div id="header">This is the header we don't care about</div>
      <div id="content">This is the <b>Main Page</b> content.  it is the
      longest block of text in this document and should be chosen as
      most likely being the important page content.</div>
   </body>
</html>

I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.

Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.

Solution

You could create an app that looks for contiguous block of text disregarding formatting tags (if required). You could do this by using a DOM parser and walking the tree, keeping track of the immediate parent (because that is your output).

Start form parent nodes and traverse the tree for each node that is just formatting, it would continue the 'count' within that sub block. It would count the characters of the content.

Once you find the most content block, traverse back up the tree to its parent to get your answer.

I think your solution relies on how you traverse the DOM and keep track of the nodes that you are scanning.

What language are you using? Any other details for your project? There may be language specific or package specific tools you could use as well.