Search code examples
pythonrequestresponsehtmlcleaner

Remove boilerplate content from HTML page


I would like to use the jusText implementation found here https://github.com/miso-belica/jusText to get the clean content out of an html page. Basically it works like this:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
      print paragraph.text

I have already downloaded the pages that I would like to parse using this tool (some of them are no longer available online), and I extract the html content out of them. Since jusText appears to be only working on the output of a request (which is a response type object), I am wondering if there is any custom way to set the content of a response object to contain the html text I would like to parse.


Solution

  • response.content is of <type 'str'>

    >>> from requests import get
    >>> r = get("http://www.google.com/")
    >>> type(r.content)
    <type 'str'>
    

    So just call:

    justext.justext(my_html_string, justext.get_stoplist("English"))