Search code examples
pythonlxmlxml-namespaces

python : lxml xpath tag name with colon


i have to parse some feed, but one of the element (tag) is with colon <dc:creator>leemore23</dc:creator>

how can i parse it using lxml? so i have done it in this way

r = requests.get('http://www.site.com/feed/')
foo = (r.content).replace("dc:creator","dc")
tree = lxml.etree.fromstring(foo)
for article_node in tree.xpath('//item'):
    data['dc'] = article_node.xpath('.//dc')[0].text.strip()

but i think there is a better way, something like

data['dc'] = article_node.xpath('.//dc:creator')[0].text.strip()

or

data['dc'] = article_node.xpath('.//dc|creator')[0].text.strip()

so without replacing

what can you advice me ?


Solution

  • The dc: prefix indicates a XML namespace. Use the elementtree API namespace support to deal with it, not just remove it from your input. As it happens, dc usually refers to Dublin Core metadata.

    You need to determine the full namespace URL, then use that URL in your XPath queries:

    DCNS = 'http://purl.org/dc/elements/1.1/'
    creator = article_node.xpath('.//{{{0}}}creator'.format(DCNS))
    

    Here I used the recommended http://purl.org/dc/elements/1.1/ namespace URL for the dublin core prefix.

    You can normally determine the URL from the .nsmap property; your root element probably has the following .nsmap attribute:

    {'dc': 'http://purl.org/dc/elements/1.1/'}
    

    and thus you can change your code to:

    creator = article_node.xpath('.//{{{0}}}creator'.format(article_node.nsmap['dc']))
    

    This can be simplified further still by passing the nsmap dictionary to the xpath() method as the namespaces keyword, at which point you can use the prefix in your xpath expression:

    creator = article_node.xpath('.//dc:creator', namespaces=article_node.nsmap)