Search code examples
pythonpython-docx

How do I declare new oxml/xmlchemy tags in python-docx?


I am trying to build basic equation functionality into python-docx to output formulas to docx files. Can someone go over the standard-operating procedure for registering a new class in oxml? Looking at the source code, a tag appears to be declared by creating a complex-type class

class CT_P(BaseOxmlElement):
    """
    ''<w:p>'' element, containing the properties and text for a paragraph.
    """
    pPr = ZeroOrOne('w:pPr')
    r = ZeroOrMore('w:r')

and then registering it using the register_element_cls() function

from .text.paragraph import CT_P
register_element_cls('w:p', CT_P)

Some classes include other methods, but many do not, so it looks like a minimum working example would be this:

from docx import Document
from docx.oxml.xmlchemy import BaseOxmlElement, ZeroOrOne, ZeroOrMore, OxmlElement
import docx.oxml
docx.oxml.ns.nsmap['m'] = ('http://schemas.openxmlformats.org/officeDocument/2006/math')

class CT_OMathPara(BaseOxmlElement):
    r = ZeroOrMore('w:r')

docx.oxml.register_element_cls('m:oMathPara',CT_OMathPara)  
p = CT_OMathPara()

(Note that I have to declare the m namespace, since it is not used in the package). Unfortunately this doesn't work for me at all. If I declare a new class derived as in the above example, and then check, for instance, the __repr__ of this new class, it causes an exception

>> p

File "C:\ProgramData\Anaconda3\lib\site-packages\docx\oxml\ns.py", line 50, in from_clark_name
    nsuri, local_name = clark_name[1:].split('}')

ValueError: not enough values to unpack (expected 2, got 1)

This happens because the tag in my class is very different from a w:p tag created from the python-docx package

>> paragraph._element.tag
 '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'

>> p.tag
 'CT_OMathPara'

But I don't know why this is. A file search through the source code does not reveal any other mentions of the CT_P class, so I'm a bit stumped.


Solution

  • I think the error is coming from the 'm' namespace prefix (nspfx) not being present in the docx.oxml.ns.pfxmap dict. The namespace needs to be looked up both ways (from nspfx to namespace url and from url to nspfx).

    So to add the new namespace from "outside", meaning after the ns module is loaded, you'll need to do both (if you were to patch the ns module code directly, that second step would be handled automatically at load time):

    nsmap, pfxmap = docx.oxml.ns.nsmap, docx.oxml.ns.pfxmap
    nsmap['m'] = 'http://schemas.openxmlformats.org/officeDocument/2006/math'
    pfxmap['http://schemas.openxmlformats.org/officeDocument/2006/math'] = 'm'
    

    This should get you to past the error you're getting, however there's a little more to understand.

    The CT_OMathPara class is an example of what's known as a custom element class. This means that lxml instantiates an object of this class for each element with the registered tag (m:oMathPara) instead of the generic lxml _Element class.

    The key thing is, you need to let lxml do the constructing, which happens when it parses the XML. You can't get a meaningful object by constructing that class yourself.

    The easiest way to create a new "loose" element (not lodged in an XML document tree) is to use docx.oxml.OxmlElement():

    oMathPara = OxmlElement('m:oMathPara')
    

    More commonly though, the docx.oxml.parse_xml() function is used to parse an entire XML snippet. A parser needs to be configured to use custom element classes and those elements have to be registered with the parser, so you probably don't want to do that for yourself when the one in the oxml module will take care of all the needful.

    So usually, to get an instance of CT_OMathPara, you would just open a docx that contained a m:oMathPara element (after registering the new namespace and custom element class), but you can also just parse in an XML snippet. If you search for parse_xml in the oxml modules you'll find plenty of examples. You need to get the namespace declarations right at the top of the XML you provide, which can be a little tricky, but you can certainly just spell out the entire XML snippet in text if you want, it just gets a bit verbose.