Search code examples
pythonrestructuredtextdocutils

Extract field list from reStructuredText


Say I have the following reST input:

Some text ...

:foo: bar

Some text ...

What I would like to end up with is a dict like this:

{"foo": "bar"}

I tried to use this:

tree = docutils.core.publish_parts(text)

It does parse the field list, but I end up with some pseudo XML in tree["whole"]?:

<document source="<string>">
    <docinfo>
        <field>
            <field_name>
                foo
            <field_body>
                <paragraph>
                    bar

Since the tree dict does not contain any other useful information and that is just a string, I am not sure how to parse the field list out of the reST document. How would I do that?


Solution

  • You can try to use something like the following code. Rather than using the publish_parts method I have used publish_doctree, to get the pseudo-XML representation of your document. I have then converted to an XML DOM in order to extract all the field elements. Then I get the first field_name and field_body elements of each field element.

    from docutils.core import publish_doctree
    
    source = """Some text ...
    
    :foo: bar
    
    Some text ...
    """
    
    # Parse reStructuredText input, returning the Docutils doctree as
    # an `xml.dom.minidom.Document` instance.
    doctree = publish_doctree(source).asdom()
    
    # Get all field lists in the document.
    fields = doctree.getElementsByTagName('field')
    
    d = {}
    
    for field in fields:
        # I am assuming that `getElementsByTagName` only returns one element.
        field_name = field.getElementsByTagName('field_name')[0]
        field_body = field.getElementsByTagName('field_body')[0]
    
        d[field_name.firstChild.nodeValue] = \
            " ".join(c.firstChild.nodeValue for c in field_body.childNodes)
    
    print d # Prints {u'foo': u'bar'}
    

    The xml.dom module isn't the easiest to work with (why do I need to use .firstChild.nodeValue rather than just .nodeValue for example), so you may wish to use the xml.etree.ElementTree module, which I find a lot easier to work with. If you use lxml you can also use XPATH notation to find all of the field, field_name and field_body elements.