Search code examples
perlrss

How to parse <rss> tag with XML::LibXML to find xmlns defintions


It seems that there is no consistent way that podcasts define their rss feeds. Ran into one that is using different schema defs for the RSS.

What's the best way to scan for xmlnamespace in an RSS url, using XML::LibXML

E.g.

One feed might be

<rss 
    xmlns:content="http://purl.org/rss/1.0/modules/content/" 
    xmlns:wfw="http://wellformedweb.org/CommentAPI/" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:atom="http://www.w3.org/2005/Atom" 
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" 
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

Another might be

<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"version="2.0"
     xmlns:atom="http://www.w3.org/2005/Atom">

I want to include in my script an assessment of all the namespaces being used so that when parsing the rss, the appropriate field names can be tracked.

Not sure what that will look like yet, as I'm not sure this module has the capability to do the <rss> tag attribute atomization that I want.


Solution

  • I'm not sure I understand exactly what kind of output you're looking for, but XML::LibXML is indeed able to list the namespaces:

    use warnings;
    use strict;
    use XML::LibXML;
    
    my $dom = XML::LibXML->load_xml(string => <<'EOT');
    <rss 
        xmlns:content="http://purl.org/rss/1.0/modules/content/" 
        xmlns:wfw="http://wellformedweb.org/CommentAPI/" 
        xmlns:dc="http://purl.org/dc/elements/1.1/" 
        xmlns:atom="http://www.w3.org/2005/Atom" 
        xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" 
        xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
    </rss>
    EOT
    for my $ns ($dom->documentElement->getNamespaces) {
        print $ns->getLocalName(), " / ", $ns->getData(), "\n";
    }
    

    Output:

    content / http://purl.org/rss/1.0/modules/content/
    wfw / http://wellformedweb.org/CommentAPI/
    dc / http://purl.org/dc/elements/1.1/
    atom / http://www.w3.org/2005/Atom
    sy / http://purl.org/rss/1.0/modules/syndication/
    slash / http://purl.org/rss/1.0/modules/slash/