I'm writing a python batch script to process many markdown files to get h1-like text to generate 'title' metadata variable (I forgot to add 'title' into frontmatter). I'm not using this as pandoc filter.
Thus I was thinking to process those files via pandoc-python, but I'm not familiar with that and I cannot figure out how to get only h1.
content = pandoc.read(post.content)
'content' is pandoc native format. And I see something like this
(Pdb) content
Pandoc(Meta({}), [Header(1, ('foobar', [], []), [Str('foobar:')]), Para(...
I would like to get h1 as simple text.
I have the following snippet that works for headers both with #
or =======
.
import pandoc
from pandoc.types import *
with open('README.md') as f:
content = pandoc.read(f.read())
# But you can use your content.
headers = []
for elt in pandoc.iter(content):
if isinstance(elt, Header):
if elt[0] == 1: # this is header 1, remove this if statement if you want all headers.
headers.append(elt[1][0])
Or if you want the exact string with upper case etc.:
for elt in pandoc.iter(content):
if isinstance(elt, Header):
if elt[0] == 1: # this is header 1, remove this if statement if you want all headers.
header.append(pandoc.write(elt[-1]).strip())