i have to parse some feed, but one of the element (tag) is with colon
<dc:creator>leemore23</dc:creator>
how can i parse it using lxml
? so i have done it in this way
r = requests.get('http://www.site.com/feed/')
foo = (r.content).replace("dc:creator","dc")
tree = lxml.etree.fromstring(foo)
for article_node in tree.xpath('//item'):
data['dc'] = article_node.xpath('.//dc')[0].text.strip()
but i think there is a better way, something like
data['dc'] = article_node.xpath('.//dc:creator')[0].text.strip()
or
data['dc'] = article_node.xpath('.//dc|creator')[0].text.strip()
so without replacing
what can you advice me ?
The dc:
prefix indicates a XML namespace. Use the elementtree API namespace support to deal with it, not just remove it from your input. As it happens, dc
usually refers to Dublin Core metadata.
You need to determine the full namespace URL, then use that URL in your XPath queries:
DCNS = 'http://purl.org/dc/elements/1.1/'
creator = article_node.xpath('.//{{{0}}}creator'.format(DCNS))
Here I used the recommended http://purl.org/dc/elements/1.1/
namespace URL for the dublin core prefix.
You can normally determine the URL from the .nsmap
property; your root element probably has the following .nsmap
attribute:
{'dc': 'http://purl.org/dc/elements/1.1/'}
and thus you can change your code to:
creator = article_node.xpath('.//{{{0}}}creator'.format(article_node.nsmap['dc']))
This can be simplified further still by passing the nsmap
dictionary to the xpath()
method as the namespaces
keyword, at which point you can use the prefix in your xpath expression:
creator = article_node.xpath('.//dc:creator', namespaces=article_node.nsmap)