Search code examples
pythonxmlelementtreeminidomprocessing-instruction

Finding and converting XML processing instructions using Python


We are converting our ancient FrameMaker docs to XML. My job is to convert this:

<?FM MARKER [Index] foo, bar ?>` 

to this:

<indexterm>
    <primary>foo, bar</primary>
</indexterm>

I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:

  • Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.

  • ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.

  • xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.

Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?


Solution

  • Use the XPath API of the lxml module as such:

    from lxml import etree
    
    foo = StringIO('<foo><bar></bar></foo>')
    tree = etree.parse(foo)
    result = tree.xpath('//processing-instruction()')
    

    The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.

    References