Search code examples
phpjsonxmlopenxmldocx

Converting DOCX / Word-generated XML to JSON


I am trying to convert a Word-generated XML file to JSON through PHP.

I have looked around and found for all XML files the best case to be the following code (even on PHP documentation):

$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

The problem is that after simplexml_load_string I get an empty SimpleXMLElement object and the rest of the steps cannot really go through. The xml itself begins as :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:wordDocument 

and the tags have a prefix of w:. I have tried removing the w:s but again the function returns an empty object. Any idea what I might be missing? Is there anything special about this type of generated XML ?


Solution

  • @ThW is correct: Don't convert OOXML to JSON. It won't help.

    The complexity of OOXML (the standard behind DOCX) will not be tamed by conversion to JSON. A successful JSON conversion would be challenging and would only really serve to provide appreciation of the general advice to use XML for documents and JSON for data.

    See also JSON or XML? Which is better? and note:

    • OOXML is an existing, highly complex standard for documents, not data.
    • Existing OOXML tool infrastructure is 100% XML-based.
    • Representing documents requires representation of mixed-content – something JSON is not designed to do.1

    1 Somewhat ironically, mixed content is rarely used in OOXML: Runs of text are generally wrapped within w:r/w:t elements. If you're looking for inspiration that a JSON-based DOCX representation would be possible, this is it. If you're looking to understand how JSON wouldn't tame the DOCX complexity, this should also help. :-)