Search code examples
pythonxmlnamespaceslxmlsbml

lxml: add namespace to input file


I am parsing an xml file generated by an external program. I would then like to add custom annotations to this file, using my own namespace. My input looks as below:

<sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" level="2" version="4">
  <model metaid="untitled" id="untitled">
    <annotation>...</annotation>
    <listOfUnitDefinitions>...</listOfUnitDefinitions>
    <listOfCompartments>...</listOfCompartments>
    <listOfSpecies>
      <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0">
        <annotation>
          <celldesigner:extension>...</celldesigner:extension>
        </annotation>
      </species>
      <species metaid="s2" id="s2" name="s2" compartment="default" initialAmount="0">
        <annotation>
           <celldesigner:extension>...</celldesigner:extension>
        </annotation>
      </species>
    </listOfSpecies>
    <listOfReactions>...</listOfReactions>
  </model>
</sbml>

The issue being that lxml only declares namespaces when they are used, which means the declaration is repeated many times, like so (simplified):

<sbml xmlns="namespace" xmlns:celldesigner="morenamespace" level="2" version="4">
  <listOfSpecies>
    <species>
      <kjw:test xmlns:kjw="http://this.is.some/custom_namespace"/>
      <celldesigner:data>Some important data which must be kept</celldesigner:data>
    </species>
    <species>
      <kjw:test xmlns:kjw="http://this.is.some/custom_namespace"/>
    </species>
    ....
  </listOfSpecies>
</sbml>

Is it possible to force lxml to write this declaration only once in a parent element, such as sbml or listOfSpecies? Or is there a good reason not to do so? The result I want would be:

<sbml xmlns="namespace" xmlns:celldesigner="morenamespace" level="2" version="4"  xmlns:kjw="http://this.is.some/custom_namespace">
  <listOfSpecies>
    <species>
      <kjw:test/>
      <celldesigner:data>Some important data which must be kept</celldesigner:data>
    </species>
    <species>
      <kjw:test/>
    </species>
    ....
  </listOfSpecies>
</sbml>

The important problem is that the existing data which is read from a file must be kept, so I cannot just make a new root element (I think?).

EDIT: Code attached below.

def annotateSbml(sbml_input):
  from lxml import etree

  checkSbml(sbml_input) # Makes sure the input is valid sbml/xml.

  ns = "http://this.is.some/custom_namespace"
  etree.register_namespace('kjw', ns)

  sbml_doc = etree.ElementTree()
  root = sbml_doc.parse(sbml_input, etree.XMLParser(remove_blank_text=True))
  nsmap = root.nsmap
  nsmap['sbml'] = nsmap[None] # Makes code more readable, but seems ugly. Any alternatives to this?
  nsmap['kjw'] = ns
  ns = '{' + ns + '}'
  sbmlns = '{' + nsmap['sbml'] + '}'

  for species in root.findall('sbml:model/sbml:listOfSpecies/sbml:species', nsmap):
    species.append(etree.Element(ns + 'test'))

  sbml_doc.write("test.sbml.xml", pretty_print=True, xml_declaration=True)

  return

Solution

  • Modifying the namespace mapping of a node is not possible in lxml. See this open ticket that has this feature as a wishlist item.

    It originated from this thread on the lxml mailing list, where a workaround replacing the root node is given as an alternative. There are some issues with replacing the root node though: see the ticket above.

    I'll put the suggested root replacement workaround code here for completeness:

    >>> DOC = """<sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" level="2" version="4">
    ...   <model metaid="untitled" id="untitled">
    ...     <annotation>...</annotation>
    ...     <listOfUnitDefinitions>...</listOfUnitDefinitions>
    ...     <listOfCompartments>...</listOfCompartments>
    ...     <listOfSpecies>
    ...       <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0">
    ...         <annotation>
    ...           <celldesigner:extension>...</celldesigner:extension>
    ...         </annotation>
    ...       </species>
    ...       <species metaid="s2" id="s2" name="s2" compartment="default" initialAmount="0">
    ...         <annotation>
    ...            <celldesigner:extension>...</celldesigner:extension>
    ...         </annotation>
    ...       </species>
    ...     </listOfSpecies>
    ...     <listOfReactions>...</listOfReactions>
    ...   </model>
    ... </sbml>"""
    >>> 
    >>> from lxml import etree
    >>> from StringIO import StringIO
    >>> NS = "http://this.is.some/custom_namespace"
    >>> tree = etree.ElementTree(element=None, file=StringIO(DOC))
    >>> root = tree.getroot()
    >>> nsmap = root.nsmap
    >>> nsmap['kjw'] = NS
    >>> new_root = etree.Element(root.tag, nsmap=nsmap)
    >>> new_root[:] = root[:]
    >>> new_root.append(etree.Element('{%s}%s' % (NS, 'test')))
    >>> new_root.append(etree.Element('{%s}%s' % (NS, 'test')))
    
    >>> print etree.tostring(new_root, pretty_print=True)
    <sbml xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" xmlns:kjw="http://this.is.some/custom_namespace" xmlns="http://www.sbml.org/sbml/level2/version4"><model metaid="untitled" id="untitled">
        <annotation>...</annotation>
        <listOfUnitDefinitions>...</listOfUnitDefinitions>
        <listOfCompartments>...</listOfCompartments>
        <listOfSpecies>
          <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0">
            <annotation>
              <celldesigner:extension>...</celldesigner:extension>
            </annotation>
          </species>
          <species metaid="s2" id="s2" name="s2" compartment="default" initialAmount="0">
            <annotation>
               <celldesigner:extension>...</celldesigner:extension>
            </annotation>
          </species>
        </listOfSpecies>
        <listOfReactions>...</listOfReactions>
      </model>
    <kjw:test/><kjw:test/></sbml>