Search code examples
htmlparsingtext-parsinghtml-content-extraction

Create Great Parser - Extract Relevant Text From HTML/Blogs


I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of BeautifulSoup / Urllib2, which is okay, but it assumes you know the proper tags for the blog entry. Does anyone have any better ideas?

Here are some thoughts maybe someone could expand upon, that I don't have enough knowledge/know-how yet to implement.

  1. The unix program 'lynx' seems to parse blog posts especially well - what parser do they use, or how could this be utilized?

  2. Are there any services/parsers that automatically remove junk ads, etc?

  3. In this case, i had a vague notion that it may be an okay assumption that blog posts are usually contained in a certain defining tag with class="entry" or something similar. Thus, it may be possible to create an algorithm that found the enclosing tags with the most clean text between them - any ideas on this?

Thanks!


Solution

  • Boy, do I have the perfect solution for you.

    Arc90's readability algorithm does exactly this. Given HTML content, it picks out the content of the main blog post text, ignoring headers, footers, navigation, etc.

    Here are implementations in:

    I'll be releasing a Perl port to CPAN in a couple of days. Done.

    Hope this helps!