Search code examples
pythonreadabilitytext-extractionhtml-content-extraction

Is there a way to use readability and python to extract just text, not HTML?


I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those.

  1. early version by gfxmonk, based on BeautifulSoup
  2. version by minvolai based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.
  3. version by Yuri Baburov aka buriy. Same as minvolai's, depens on lxml. Also depends on chardet to detect encoding.

I use Yuri's version, as it is most recent, and seems to be in active development. I managed to make it run on Google App Engine using Python 2.7. Now the "problem" is that it returns HTML, whereas I need pure text.

The advice in this Stackoverflow article about links extraction, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.

My questions:

  • Is there a way to get pure text from Python Readability version that I use without forking the code?
  • Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else
  • If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?

Solution

  • Not to let it linger, my current solution

    1. I did not find the way to use Readability ports.
    2. I decided to use Beautiful Soup, version 4
    3. BS has one simple function to extract text

    code:

    from bs4 import BeautifulSoup 
    soup = BeautifulSoup(html) 
    text =  soup.get_text()