Search code examples
pythonxmlxpathlxmlapostrophe

Select xml node by xpath with attribute value containing apostroph


I'm trying to extract some data from a given XML file. Therefore, I have to select some specific nodes by their attribute values. My XML looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<svg ....>
    ....
    <g font-family="'BentonSans Medium'" font-size="12">
        <text>bla bla bla</text>
        ....
    </g>
    ....
</svg>

I've tried to escape the apostrophs in the value but I couldn't get it working.

from lxml import etree as ET

tree = ET.parse("file.svg")
root = tree.getroot()

xPath = ".//g[@font-family='&apos;BentonSans Medium&apos;]"
print(root.findall(xPath))

I always get errors of this kind:

File "C:\Python34\lib\site-packages\lxml\_elementpath.py", line 214, in prepare_predicate
raise SyntaxError("invalid predicate")

Anyone got ideas how to select these nodes with XPath?


Solution

  • Try this:

    xPath = ".//g[@font-family=\"'BentonSans Medium'\"]"
    

    Your code fails because you haven't put the closing single quote:

    xPath = ".//g[@font-family='&apos;BentonSans Medium&apos;]"
    

    It should be after the last &apos;:

    xPath = ".//g[@font-family='&apos;BentonSans Medium&apos;']"
    

    But it doesn't make the XPath expression correct, as &apos; is interpreted just as is.


    By the way, if you want to check if the font-family contains the given string, use contains() XPath function with the xpath method:

    xPath = '//g[contains(@font-family, "BentonSans Medium")]'
    print(root.xpath(xPath))
    

    Output

    [<Element g at 0x7f2093612108>]
    

    The sample code fetches all g elements with font-family attribute values containing BentonSans Medium string.

    I don't know why the findall method doesn't work with contains(), but the xpath seems more flexible, and I would recommend using this method instead.