Search code examples
pythonxmlelementtree

Parse XML with namespace attribute changing in Python


I am making a request to a URL and in the xml response I get, the xmlns attribute namespace changes from time to time. Hence finding an element returns None when I hardcode the namespace.

For instance I get the following XML:

<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https://github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>

Now see the xmlns attribute. The entire attribute is same but sometimes the '2012/06' part keeps changing from time to time for certain responses. I have the following python script. See the line ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}. I can't hardcode the namespace like that. Are there any alternatives like using regular expressions etc to map the namespace? Only the date part changes i.e. 2013/05 in some responses its 2012/04 etc.

def fetch_nuget_spec(self, versioned_package):
        name = versioned_package.package.name.lower()
        version = versioned_package.version.lower()
        url = f'https://api.nuget.org/v3-flatcontainer/{name}/{version}/{name}.nuspec'
        response = requests.get(url)
        metadata = ET.fromstring(response.content)
        ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
        license = metadata.find('./nuspec:metadata/nuspec:license', ns)
        if license is None:
            license_url=metadata.find('./nuspec:metadata/nuspec:licenseUrl', ns)
            if license_url is None:
                return { 'license': 'Not Found'  }
            return {'license':license_url.text}
        else:
            if len(license.text)==0:
                print('SHIT')
            return { 'license': license.text  }

  

Solution

  • Without another modul, all with xml.etree.ElementTree:

    import xml.etree.ElementTree as ET
    
    tree = ET.parse('xml_str.xml')
    root = tree.getroot()
    
    ns = dict([node for _, node in ET.iterparse('xml_str.xml', events=['start-ns'])])
    print(ns)
    
    licenseUrl = root.find(".//licenseUrl", ns).text
    print("LicenseUrl: ", licenseUrl)
    

    Output:

    {'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
    LicenseUrl:  https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt
    

    Option 2, if parsing time is important:

    
    import xml.etree.ElementTree as ET
    
    nsmap = {}
    for event, node in ET.iterparse('xml_str.xml', events=['start-ns', 'end']):
        
        if event == 'start-ns':
            ns, url = node
            nsmap[ns] = url
            print(nsmap)
    
        if event == 'end' and node.tag == f"{{{url}}}licenseUrl":
            print(node.text)
    

    Output:

    
    {'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
    https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt