Search code examples
xmlbashxpathxmllint

Extract multiple fields same named fields using xmllint


I've got an XML file with a lot of media fields. A piece of example XML is:

<root>
    <item>
        <name>Item 1</name>
        <mediaList>
            <media>
                <name>Name 1</name>
                <URL><![CDATA[http://example.com/image1.jpg]]></URL>
            </media>
            <media>
                <name>Name 2</name>
                <URL><![CDATA[http://example.com/image2.jpg]]></URL>
            </media>
        </mediaList>
    </item>
    <item>
        <name>Item 2</name>
        <mediaList>
            <media>
                <name>Name 3</name>
                <URL><![CDATA[http://example.com/image3.jpg]]></URL>
            </media>
            <media>
                <name>Name 4</name>
                <URL><![CDATA[http://example.com/image4.jpg]]></URL>
            </media>
        </mediaList>
    </item>
</root>

All items are built in the same way. Using XMLLint with XPath, I'm trying to get a list of all URLs. However, so far, I haven't found the best way to go about it yet. Some of the ways I've tried it are:

xmllint --xpath "string(/root/item/mediaList/URL)" file.xml >> log.txt

This one returns a nice URL, but stops after the first item (giving me only 1 image)

xmllint --xpath "/root/item/mediaList/URL" file.xml >> log.txt

This gives me all items, but everything is on the same line, and is shown as <URL><![CDATA[http://example.com/image.jpg]]></URL> for each item.

xmllint --xpath "/root/item/mediaList/URL/text()" file.xml >> log.txt

This comes closest, but still returns the <![CDATA[]]> tags around it, and again all in one line.

I've also tried looping through the items, but this was very slow, and didn't work as it should.

The result I'm aiming for is a txt file with all images below eachother, like so:

http://example.com/image1.jpg
http://example.com/image2.jpg
http://example.com/image3.jpg
http://example.com/image4.jpg

Solution

  • The xmllint doesn't support the string(...) for multiple XPath matches. (Therefore it shows only the 1st result).

    You can use xmlstarlet like:

    xmlstarlet sel -T -t -m /root/item/mediaList/media/URL -v . -n file.xml
    

    and it produces

    http://example.com/image1.jpg
    http://example.com/image2.jpg
    http://example.com/image3.jpg
    http://example.com/image4.jpg
    

    or also perl (with the installed XML::LibXML module) as:

    perl -MXML::LibXML -E 'say $_->to_literal for XML::LibXML->load_xml(location=>q{file.xml})->findnodes(q{/root/item/mediaList/media/URL})'
    

    also produces same result as above.