Search code examples
pythonhtmlrestructuredtextdocutils

How do I convert a docutils document tree into an HTML string?


I'm trying to use the docutils package to convert ReST to HTML. This answer succinctly uses the docutils publish_* convenience functions to achieve this in one step. The ReST documents that I want to convert have multiple sections that I want to separate in the resulting HTML. As such, I want to break this process down:

  1. Parse the ReST into a tree of nodes.
  2. Separate the nodes as appropriate.
  3. Convert the nodes I want into HTML.

It's step three that I'm struggling with. Here's how I do steps one and two:

from docutils import utils
from docutils.frontend import OptionParser
from docutils.parsers.rst import Parser

# preamble
rst = '*NB:* just an example.'   # will actually have many sections
path = 'some.url.com'
settings = OptionParser(components=(Parser,)).get_default_values()

# step 1
document = utils.new_document(path, settings)
Parser().parse(rst, document)

# step 2
for node in document:
   do_something_with(node)

# step 3: Help!
for node in filtered(document):
   print(convert_to_html(node))

I've found the HTMLTranslator class and the Publisher class. They seem relevant but I'm struggling to find good documentation. How should I implement the convert_to_html function?


Solution

  • My problem was that I was trying to use the docutils package at too low a level. They provide an interface for this sort of thing:

    from docutils.core import publish_doctree, publish_from_doctree
    
    rst = '*NB:* just an example.'
    
    # step 1
    tree = publish_doctree(rst)
    
    # step 2
    # do something with the tree
    
    # step 3
    html = publish_from_doctree(tree, writer_name='html').decode()
    print(html)
    

    Step one is now much simpler. That said, I'm still slightly dissatisfied with the result; I realise that what I really want is a publish_node function. If you know a better way please do post it.

    I should also note that I haven't managed to get this working with Python 3.

    The real lesson

    What I was actually trying to do was extract all of the sidebar elements from the doctree so they can be handled separately to the main body of the article. This is not the sort of use case that docutils was intended to solve. Hence no publish_node function.

    Once I realised this, the correct approach was simple enough:

    1. Generate the HTML using docutils.
    2. Extract the sidebar elements using BeautifulSoup.

    Here's the code that got the job done:

    from docutils.core import publish_parts
    from bs4 import BeautifulSoup
    
    rst = get_rst_string_from_somewhere()
    
    # get just the body of an HTML document 
    html = publish_parts(rst, writer_name='html')['html_body']
    soup = BeautifulSoup(html, 'html.parser')
    
    # docutils wraps the body in a div with the .document class
    # we can just dispose of that div altogether
    wrapper = soup.select('.document')[0]
    wrapper.unwrap()
    
    # knowing that docutils gives all sidebar elements the
    # .sidebar class makes extracting those elements easy
    sidebar = ''.join(tag.extract().prettify() for tag in soup.select('.sidebar'))
    
    # leaving the non-sidebar elements as the document body
    body = soup.prettify()