Search code examples

Using XPath in Python with LXML

I have a python script used to parse XMLs and export into a csv file certain elements of interest. I have tried to now change the script to allow the filtering of an XML file under a criteria, the equivalent XPath query would be:


When I try to use lxml to do so, my code is:

xml_file = lxml.etree.parse(xml_file_path)
namespace = "{" + xml_file.getroot().nsmap[None] + "}"
node_list = xml_file.findall(namespace + "Events/" + namespace + "Confirmation[TransactionId='*GTEREVIEW*']")

But this doesn't seem to work. Can anyone help? Example of XML file:


So I want all "Confirmation" nodes that contain a transaction Id which includes the string "GTEREVIEW". Thanks


  • findall() doesn't support XPath expressions, only ElementPath (see ElementPath doesn't support searching for elements containing a certain string.

    Why don't you use XPath? Assuming that the file test.xml contains your sample XML, the following works:

    > python
    Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
    [GCC 4.9.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from lxml import etree
    >>> tree=etree.parse("test.xml")
    >>> tree.xpath("Confirmation[starts-with(TransactionId, 'GTEREVIEW')]")
    [<Element Confirmation at 0x7f68b16c3c20>]

    If you insist on using findall(), the best you can do is get the list of all Confirmation elements having a TransactionId child node:

    >>> tree.findall("Confirmation[TransactionId]")
    [<Element Confirmation at 0x7f68b16c3c20>, <Element Confirmation at 0x7f68b16c3ea8>]

    You then need to filter this list manually, e.g.:

    >>> [e for e in tree.findall("Confirmation[TransactionId]")
         if e[0].text.startswith('GTEREVIEW')]
    [<Element Confirmation at 0x7f68b16c3c20>]

    If your document contains namespaces, the following will get you all Confirmation elements having a TransactionId child node, provided that the elements use the default namespace (I used xmlns="file:xyz" as the default namespace):

    >>> tree.findall("//{{{0}}}Confirmation[{{{0}}}TransactionId]".format(tree.getroot().nsmap[None]))
    [<Element {file:xyz}Confirmation at 0x7f534a85d1b8>, <Element {file:xyz}Confirmation at 0x7f534a85d128>]

    And there is of course etree.ETXPath:

    >>> find=etree.ETXPath("//{{{0}}}Confirmation[starts-with({{{0}}}TransactionId, 'GTEREVIEW')]".format(tree.getroot().nsmap[None]))
    >>> find(tree)
    [<Element {file:xyz}Confirmation at 0x7f534a85d1b8>]

    This allows you to combine XPath and namespaces.