Search code examples
pythonxmlxpathlxmlxml-namespaces

Parsing Google Earth KML file in Python (lxml, namespaces)


I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).

As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:

from lxml import etree
tree=etree.parse('kmlfile')

Here is the example from the tutorial I am trying to emulate:

If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:

for element in root.getiterator("child"):
    print element.tag, '-', element.text

I would like to get all data under 'Placemark', so I tried

for i in tree.getiterterator("Placemark"):
    print i, type(i)

which doesn't give me anything. What does work is:

for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
    print i, type(i)

I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand

  • how the part in {} relates to my specific example at all

  • why it is different from the tutorial

  • and what I am doing wrong

Any help is much appreciated!


Solution

  • Here is my solution. So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.

    We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.

    ###Problem

    Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:

    <Placemark> <name>CITY NAME</name> 
    

    The XPath equivalent to the non-functional code I posted above is:

    tree=etree.parse('kml document')
    result=tree.xpath('//Placemark/name/text()')
    

    Where the text() part is needed to get the text contained in the location //Placemark/name.

    Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:

    xmlns="http://www.opengis.net/kml/2.2"
    

    ###Solution

    We can supply namespaces to xpath by setting the namespaces argument:

    xpath(X, namespaces={prefix: namespace})
    

    This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".

    However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:

    tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})
    

    The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.