Search code examples
pythonxmldomxml-parsing

How to scrape XML with Python?


I am trying to parse the following XML with Python. I am using:

thumbnail_tag = dom.getElementsByTagName('media:thumbnail')[0].toxml()

This selects the first one. I know I can change the [0] to [1] to get the tag with yt:name="mqdefault", but is there another way to change the parameter in the statement above (add something to media:thumbnail)?

<entry>
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</entry>

<entry>
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</entry>

Solution

  • To create a dom object of this xml string you have to define XML Namespaces in the root tag or in the same tag.

    The Namespace is defined by the xmlns attribute in the start of an element.

    The namespace declaration has the following syntax:

    xmlns:prefix="URI"
    

    For Example:

    <root>
        <h:table xmlns:h="http://bluejson.com/W3C/">
            <h:tr>
                <h:td>JSON</h:td>
                <h:td>JavaScript</h:td>
                <h:td>Python</h:td>
            </h:tr>
        </h:table>
    
        <f:table xmlns:f="http://bluejson.com/W3C/">
            <f:name>My Study Room</f:name>
            <f:width>800</f:width>
            <f:height>420</f:height>
            <f:length>1120</f:length>
        </f:table>
    </root>
    

    In the example above, the xmlns attribute in the tag give the h: and f: prefixes a qualified namespace.

    When a namespace is defined for an element, all child elements with the same prefix are associated with the same namespace.

    Namespaces can be declared in the elements where they are used or in the XML root element:

    <root xmlns:h="http://bluejson.com/W3C/" xmlns:f="http://bluejson.com/W3C/">
        <h:table>
            <h:tr>
                <h:td>JSON</h:td>
                <h:td>JavaScript</h:td>
                <h:td>Python</h:td>
            </h:tr>
        </h:table>
    
        <f:table>
            <f:name>My Study Room</f:name>
            <f:width>800</f:width>
            <f:height>420</f:height>
            <f:length>1120</f:length>
        </f:table>
    </root>
    

    Now, The Python code to create you xml dom Object and get attributes

    import xml.dom.minidom
    
    dom = xml.dom.minidom.parseString("""
    <root xmlns:media="http://media/" xmlns:yt="http://media/yt/">
        <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
        <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
        <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
    </root>""")
    
    media_thumbnail = dom.getElementsByTagNameNS("http://media/","thumbnail")
    print media_thumbnail[0].getAttribute("height")
    print media_thumbnail[0].getAttribute("width")
    print media_thumbnail[0].getAttribute("time")
    print media_thumbnail[0].getAttributeNS("http://media/yt/","name")
    media_thumbnail[0].setAttribute("unit","px")
    media_thumbnail[0].setAttributeNS("http://media/yt/","value","1")
    print dom.toxml()
    

    Output:

    90
    120
    00:01:48.500
    default
    <?xml version="1.0" ?><root xmlns:media="http://media/" xmlns:yt="http://media/yt/">
        <media:thumbnail height="90" time="00:01:48.500" unit="px" url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" value="1" width="120" yt:name="default"/>
        <media:thumbnail height="180" url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" width="320" yt:name="mqdefault"/>
        <media:thumbnail height="360" url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" width="480" yt:name="hqdefault"/>
    </root>