Search code examples
pythonpython-sphinxrestructuredtextdocutils

docutils traverse sections and create list of subsections with their paths


This questions attempts to extend the code from Jason S (Thanks, Jason), found here: Docutils: traverse sections?.

Original from Jason S

import docutils

def doctree_resolved(app, doctree, docname):
    for section in doctree.traverse(docutils.nodes.section):
        title = section.next_node(docutils.nodes.Titular)
        if title:
            print title.astext()

def setup(app):
    app.connect('doctree-resolved', doctree_resolved)

Now, suppose I want to capture the text of only H2 subsections (or at least all subsections if that's the only option).

In theory, I'm trying to create a dictionary of subsection titles, with their respective urls/paths. What is the best way to do so?

My revision -to original above- is unsuccessful, but hopefully, you can understand my approach by reading the code. A simple for-loop using list-building is not successful, I believe, because of the line title = section.next_node(docutils.nodes.Titular), specifically, next_node.... I looked but cannot find documentation on next_node.

So, what might be a different way to achieve capturing the H2 subsections from each document so I can build a dictionary where each subsection has a path/url? NOTE: I have not yet attempted to construct the full URL in the code below.

import docutils

docname_list = []
section_list = []

def doctree_resolved(app, doctree, docname):
    for section in doctree.traverse(docutils.nodes.section):
        title = section.next_node(docutils.nodes.Titular)
        if title:
            print(title.astext())
            docname_list.append(docname)
            section_list.append(title.astext())
url_dict = dict(zip(docname_list, section_list)
...

Here's the "hacky" method I used (also in comment below), at first, which I knew wasn't really the best way to derive the "top title" of each document.

def run(self):     
    node_sec = nodes.section()     
    docs = node_sec.document     
    title_list = [ ] 
    
    for t in docs.traverse(docutils.nodes.title):
        if t.tagname == "title" and t is not None:             
            title_list.append(t.astext())
            break

Solution

  • The easiest way to get a list of all section titles in a document is

    from docutils import nodes
    
    def all_section_titles(doctree):
        return [section.next_node(nodes.title).astext()
                for section in doctree.findall(nodes.section)]
    

    Looking directly for nodes.title elements would also include the global document title as well as titles of topics, tables, and generic admonitions. For details on the used element methods, see their Python docstring help(nodes.Element.findall) and help(nodes.Element.next_node). The doctree object and its elements are described in The Docutils Document Tree.

    As <section> elements and <title> elements in a Docutils Document Tree do not have a level attribute (nor level-depending names like H1, H2, ... in HTML), the nodes.Element.findall() method cannot easily be used to get sections/section titles of a specific level. We have to do the traversion "by hand" iterating over child elements.

    Depending on the presence/absence of a global document title, the HTML output may use the "H2" HTML element for either top-level section headings or sub-section headings. Here are two functions to capture the title elements of top-level sections respectively sub-sections:

    from docutils import nodes
    
    def top_section_titles(doctree):
        return [child.next_node(nodes.title).astext()
                for child in doctree
                if isinstance(child, nodes.section)]
    
    def sub_section_titles(doctree):
        titles = []
        for child in doctree:
            if not isinstance(child, nodes.section):
                continue
            for grandchild in child:
                if not isinstance(grandchild, nodes.section):
                    continue
                titles.append(grandchild.next_node(nodes.title).astext())
        return titles
    

    The HTML document has auto-generated anchors for the section element. A mapping of section-title to URL can be achieved via:

    def section_urls(titles, base_url):
        return dict((title, f'{base_url}#{nodes.make_id(title)}')
                    for title in titles)
    

    See help(nodes.make_id) for the function that converts the title text to an ID.

    Lets test these functions:

    from docutils import core
    base_url = "file:demo.html"
    
    sample = """\
    Document Title
    ==============
    
    :docinfo: optional
    
    first section heading
    ---------------------
    first section text
    
    subsection 1.1
    ..............
    subsection text
    
    second section heading
    ----------------------
    second section text
    
    subsection 2.1
    ..............
    subsection text
    """
    
    def demo(fun):
        doctree = core.publish_doctree(sample, base_url)
        # print(doctree.pformat())  # show the document structure as "pseudo-XML"
        titles = fun(doctree)
        # print titles
        url_dict = section_urls(titles, base_url)
        for title, url in url_dict.items():
            print(f'{title}: {url}')
            
    print('extract all')
    demo(all_section_titles)
    
    print('\nextract top')
    demo(top_section_titles)
    
    print('\nextract sub')
    demo(sub_section_titles)