Search code examples
pythonmarkdownpandoc

Getting h1 from markdown via python's pandoc library


I'm writing a python batch script to process many markdown files to get h1-like text to generate 'title' metadata variable (I forgot to add 'title' into frontmatter). I'm not using this as pandoc filter.

Thus I was thinking to process those files via pandoc-python, but I'm not familiar with that and I cannot figure out how to get only h1.

content = pandoc.read(post.content)

'content' is pandoc native format. And I see something like this

(Pdb) content                                                                                                                                                                                                                                 
Pandoc(Meta({}), [Header(1, ('foobar', [], []), [Str('foobar:')]), Para(...

I would like to get h1 as simple text.


Solution

  • I have the following snippet that works for headers both with # or =======.

    import pandoc
    from pandoc.types import *
    
    with open('README.md') as f:
        content = pandoc.read(f.read()) 
    # But you can use your content.
    headers = []
    
    for elt in pandoc.iter(content):
         if isinstance(elt, Header):
             if elt[0] == 1: # this is header 1, remove this if statement if you want all headers.
                 headers.append(elt[1][0])
    

    Or if you want the exact string with upper case etc.:

    for elt in pandoc.iter(content):
        if isinstance(elt, Header):
            if elt[0] == 1: # this is header 1, remove this if statement if you want all headers.
                header.append(pandoc.write(elt[-1]).strip())