Search code examples
pythonfilterpandocpanflute

Pandoc Filter via Panflute not Working as Expected


Problem

For a Markdown document I want to filter out all sections whose header titles are not in the list to_keep. A section consists of a header and the body until the next section or the end of the document. For simplicity lets assume that the document only has level 1 headers.

When I make a simple case distinction on whether the current element has been preceeded by a header in to_keep and do either return None or return [] I get an error. That is, for pandoc --filter filter.py -o output.pdf input.md I get TypeError: panflute.dump needs input of type "panflute.Doc" but received one of type "list" (code, example file and complete error message at the end).

I use Python 3.7.4 and panflute 1.12.5 and pandoc 2.2.3.2.

Question

If make a more fine grained distinction on when to do return [], it works (function action_working). My question is, why is this more fine grained distinction neccesary? My solution seems to work, but it might well be accidental... How can I get this to work properly?

Files

error

Traceback (most recent call last):
  File "filter.py", line 42, in <module>
    main()
  File "filter.py", line 39, in main
    return run_filter(action_not_working, doc=doc)
  File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 266, in run_filter
    return run_filters([action], *args, **kwargs)
  File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 253, in run_filters
    dump(doc, output_stream=output_stream)
  File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 132, in dump
    raise TypeError(msg)
TypeError: panflute.dump needs input of type "panflute.Doc" but received one of type "list"
Error running filter filter.py:
Filter returned error status 1

input.md

# English 
Some cool english text this is!

# Deutsch 
Dies ist die deutsche Übersetzung!

# Sources
Some source.

# Priority
**Medium** *[Low | Medium | High]*

# Status
**Open for Discussion** *\[Draft | Open for Discussion | Final\]*

# Interested Persons (mailing list)
- Franz, Heinz, Karl

fiter.py

from panflute import *

to_keep = ['Deutsch', 'Status']
keep_current = False

def action_not_working(elem, doc):
    '''For every element we check if it occurs in a section we wish to keep. 
    If it is, we keep it and return None (indicating to keep the element unchanged).
    Otherwise we remove the element (return []).'''
    global to_keep, keep_current
    update_keep(elem)
    if keep_current:
        return None
    else:
        return []

def action_working(elem, doc):
    global to_keep, keep_current
    update_keep(elem)
    if keep_current:
        return None
    else:
        if isinstance(elem, Header):
            return []
        elif isinstance(elem, Para):
            return []
        elif isinstance(elem, BulletList):
            return []

def update_keep(elem):
    '''if the element is a header we update to_keep.'''
    global to_keep, keep_current
    if isinstance(elem, Header):
        # Keep if the title of a section is in too keep
        keep_current = stringify(elem) in to_keep


def main(doc=None):
    return run_filter(action_not_working, doc=doc) 

if __name__ == '__main__':
    main()

Solution

  • I think what happens is that panflute call the action on all elements, including the Doc root element. If keep_current is False when walking the Doc element, it will be replaced by a list. This leads to the error message you are seeing, as panflute expectes the root node to always be there.

    The updated filter only acts on Header, Para, and BulletList elements, so the Doc root node will be left untouched. You'll probably want to use something more generic like isinstance(elem, Block) instead.


    An alternative approach could be to use panflute's load and dump elements directly: load the document into a Doc element, manually iterate over all blocks in args and remove all that are unwanted, then dump the resulting doc back into the output stream.

    from panflute import *
    
    to_keep = ['Deutsch', 'Status']
    keep_current = False
    
    doc = load()
    for top_level_block in doc.args:
        # do things, remove unwanted blocks
    
    dump(doc)