This questions attempts to extend the code from Jason S (Thanks, Jason), found here: Docutils: traverse sections?.
Original from Jason S
import docutils
def doctree_resolved(app, doctree, docname):
for section in doctree.traverse(docutils.nodes.section):
title = section.next_node(docutils.nodes.Titular)
if title:
print title.astext()
def setup(app):
app.connect('doctree-resolved', doctree_resolved)
Now, suppose I want to capture the text of only H2 subsections (or at least all subsections if that's the only option).
In theory, I'm trying to create a dictionary of subsection titles, with their respective urls/paths. What is the best way to do so?
My revision -to original above- is unsuccessful, but hopefully, you can understand my approach by reading the code. A simple for-loop using list-building is not successful, I believe, because of the line title = section.next_node(docutils.nodes.Titular)
, specifically, next_node...
. I looked but cannot find documentation on next_node
.
So, what might be a different way to achieve capturing the H2 subsections from each document so I can build a dictionary where each subsection has a path/url? NOTE: I have not yet attempted to construct the full URL in the code below.
import docutils
docname_list = []
section_list = []
def doctree_resolved(app, doctree, docname):
for section in doctree.traverse(docutils.nodes.section):
title = section.next_node(docutils.nodes.Titular)
if title:
print(title.astext())
docname_list.append(docname)
section_list.append(title.astext())
url_dict = dict(zip(docname_list, section_list)
...
Here's the "hacky" method I used (also in comment below), at first, which I knew wasn't really the best way to derive the "top title" of each document.
def run(self):
node_sec = nodes.section()
docs = node_sec.document
title_list = [ ]
for t in docs.traverse(docutils.nodes.title):
if t.tagname == "title" and t is not None:
title_list.append(t.astext())
break
The easiest way to get a list of all section titles in a document is
from docutils import nodes
def all_section_titles(doctree):
return [section.next_node(nodes.title).astext()
for section in doctree.findall(nodes.section)]
Looking directly for nodes.title
elements would also include the global document title as well as titles of topics, tables, and generic admonitions.
For details on the used element methods, see their Python docstring help(nodes.Element.findall)
and help(nodes.Element.next_node)
.
The doctree
object and its elements are described in The Docutils Document Tree.
As <section>
elements and <title>
elements in a Docutils Document Tree do not have a level attribute (nor level-depending names like H1, H2, ... in HTML), the nodes.Element.findall()
method cannot easily be used to get sections/section titles of a specific level.
We have to do the traversion "by hand" iterating over child elements.
Depending on the presence/absence of a global document title, the HTML output may use the "H2" HTML element for either top-level section headings or sub-section headings. Here are two functions to capture the title elements of top-level sections respectively sub-sections:
from docutils import nodes
def top_section_titles(doctree):
return [child.next_node(nodes.title).astext()
for child in doctree
if isinstance(child, nodes.section)]
def sub_section_titles(doctree):
titles = []
for child in doctree:
if not isinstance(child, nodes.section):
continue
for grandchild in child:
if not isinstance(grandchild, nodes.section):
continue
titles.append(grandchild.next_node(nodes.title).astext())
return titles
The HTML document has auto-generated anchors for the section element. A mapping of section-title to URL can be achieved via:
def section_urls(titles, base_url):
return dict((title, f'{base_url}#{nodes.make_id(title)}')
for title in titles)
See help(nodes.make_id)
for the function that converts the title text to an ID.
Lets test these functions:
from docutils import core
base_url = "file:demo.html"
sample = """\
Document Title
==============
:docinfo: optional
first section heading
---------------------
first section text
subsection 1.1
..............
subsection text
second section heading
----------------------
second section text
subsection 2.1
..............
subsection text
"""
def demo(fun):
doctree = core.publish_doctree(sample, base_url)
# print(doctree.pformat()) # show the document structure as "pseudo-XML"
titles = fun(doctree)
# print titles
url_dict = section_urls(titles, base_url)
for title, url in url_dict.items():
print(f'{title}: {url}')
print('extract all')
demo(all_section_titles)
print('\nextract top')
demo(top_section_titles)
print('\nextract sub')
demo(sub_section_titles)