Search code examples
docx

Where to find the schema (.xsd file) for Microsoft docx format


Consider a user that needs a text of docx document without the headers and footers for processing in R.

If a file.docx is renamed as file.zip and the document document.xml is analyzed - it is a well formed XML document with the text.

Did Microsfot (or other developer) publish a schema for this document.xml subfile in the ZIP package of docx file?

The file looks like this:

    <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
- <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
- <w:body>
- <w:p w:rsidR="00F447D7" w:rsidRPr="00C63308" w:rsidRDefault="00F447D7">
- <w:pPr>

Solution

  • From wikipedia:

    The format was initially standardised by Ecma (as ECMA-376) and, in later versions, by ISO and IEC (as ISO/IEC 29500).

    You can find various versions of the XSD in the ECMA-376 downloads

    document.xml conforms to the WordprocessingML part of the schemas (look for wml.xsd).