python xml elementtree minidom processing-instruction

Finding and converting XML processing instructions using Python

We are converting our ancient FrameMaker docs to XML. My job is to convert this:

<?FM MARKER [Index] foo, bar ?>`

to this:

<indexterm>
    <primary>foo, bar</primary>
</indexterm>

I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:

Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.

Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?

Solution

Use the XPath API of the lxml module as such:

from lxml import etree

foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')

The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.

References