Search code examples
phpxmlopenxmlxmlreader

PHP replace text in Office OpenXML files with XMLWriter/XMLReader


I'm using XMLReader to find text in a Office OpenXML document and XMLWriter to write it to a xliff file. I then modify the text in the other xml file and now I want to rebuild the OpenXML document. I am using the XML iterator class like suggesetd in this question

I want to replace the nodes content in the original file with the nodes content from the xliff file, checking if the count of node is the same from attribute. So the 10th node will be replaced with the if it exists.

What's happening now with my code is that it's not replacing the tag contents. It's generating self enclosed empty tags and placing the original content after it. And just after this tag it's closing the document.

xliff file - segments.xliff

    <?xml version="1.0"?>
<xliff>
 <file original="/home/brgwe507/public_html/previas/wp-content/uploads/sites/9/2015/03/Cap32.docx" datatype="x-noveritis" source-language="pt-BR">
  <body>
   <trans-unit id="177">
    <source><g id="217">In a thermodynamic process, energy is transferred to or from a system by two primary methods.</g></source><seg-source><mrk mtype="seg" id="1"><g id="217">In a thermodynamic process, energy is transferred to or from a system by two primary methods.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="1"><g id="217">tradução segmento1.</g></mrk> </target>
   </trans-unit>
   <trans-unit id="178">
    <source><g id="217">The first method to be considered is work and the second, which will follow in Section 3.2, is heat transfer.</g></source><seg-source><mrk mtype="seg" id="2"><g id="217">The first method to be considered is work and the second, which will follow in Section 3.2, is heat transfer.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="2"><g id="217">tradução segmento 2</g></mrk> </target>
   </trans-unit>
   <trans-unit id="179">
    <source><g id="218">Work, designated </g><g id="219">W</g><g id="220">, is defined in mechanics as the product of a force and the distance moved in the direction of the force.</g></source><seg-source><mrk mtype="seg" id="3"><g id="218">Work, designated </g><g id="219">W</g><g id="220">, is defined in mechanics as the product of a force and the distance moved in the direction of the force.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="3"><g id="218">tradução</g><g id="219">teste</g><g id="220">, segmento 3</g></mrk> </target>
   </trans-unit>
   <trans-unit id="180">
    <source><g id="220">A more general definition of work is used in thermodynamics:</g><g id="221">Work</g><g id="222">, an interaction between a system and its surroundings, is done by a system if the sole external effect on the surroundings could be the raising of a weight.</g></source><seg-source><mrk mtype="seg" id="4"><g id="220">A more general definition of work is used in thermodynamics:</g><g id="221">Work</g><g id="222">, an interaction between a system and its surroundings, is done by a system if the sole external effect on the surroundings could be the raising of a weight.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="4"><g id="220">tradução deste segmento:</g><g id="221">para</g><g id="222">teste de tradução segmento 4.</g></mrk> </target>
   </trans-unit>
   <trans-unit id="181">
    <source><g id="222">The magnitude of the work is the product of the weight and the distance it could be </g><g id="223">lifted.This</g><g id="224"> definition allows a battery to do work since the energy produced by the battery could be the lifting of a weight, as suggested in Fig.</g></source><seg-source><mrk mtype="seg" id="5"><g id="222">The magnitude of the work is the product of the weight and the distance it could be </g><g id="223">lifted.This</g><g id="224"> definition allows a battery to do work since the energy produced by the battery could be the lifting of a weight, as suggested in Fig.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="5"><g id="222">tradução para teste </g><g id="223">xliff.</g><g id="224"> semgneto 5 ladsfoienfoqeiwnf</g></mrk> </target>
   </trans-unit>
   <trans-unit id="182">
    <source><g id="224">3.2.Work has unit</g><g id="225">s of N </g><g id="226">[S]</g><g id="227"> </g><g id="228">m 5 J.</g></source><seg-source><mrk mtype="seg" id="6"><g id="224">3.2.Work has unit</g><g id="225">s of N </g><g id="226">[S]</g><g id="227"> </g><g id="228">m 5 J.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="6"><g id="224">3.2. teste</g><g id="225">1 de 7 </g><g id="226">[S]</g><g id="227"> </g><g id="228">segmento.</g></mrk> </target>
   </trans-unit>
   <trans-unit id="183">
    <source><g id="228">The work done per unit mass, or </g><g id="229">specific work</g><g id="230">, is</g></source><seg-source><mrk mtype="seg" id="7"><g id="228">The work done per unit mass, or </g><g id="229">specific work</g><g id="230">, is</g></mrk></seg-source>
    <target><mrk mtype="seg" id="7"><g id="228">Para tradução </g><g id="229">segmento</g><g id="230">, é</g></mrk> </target>
   </trans-unit>
  </body>
 </file>
</xliff>

original document.xml to be updated

<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
<w:body>
<w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
<w:pPr>
<w:rPr>
<w:b/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="004F10D0">
<w:rPr>
<w:b/>
</w:rPr>
<w:t>CHAPTER 3</w:t>
</w:r>
</w:p>
...
<w:p w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
<w:pPr>
<w:rPr>
<w:b/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="009D4166">
<w:rPr>
<w:b/>
</w:rPr>
<w:t>Figure 3.57</w:t>
</w:r>
</w:p>
<w:sectPr w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidSect="004F10D0">
<w:headerReference w:type="even" r:id="rId7"/>
<w:pgSz w:w="11905" w:h="16840"/>
<w:pgMar w:top="1417" w:right="1701" w:bottom="1417" w:left="1701" w:header="0" w:footer="1305" w:gutter="0"/>
<w:cols w:space="720"/>
</w:sectPr>
</w:body>
</w:document>

PHP Code

    $xmlInputFile  = 'document.xml';
    $xmlOutputFile = 'new_document.xml';
    $xmlxliff = 'segments.xliff';

    $reader = new XMLReader();
    $reader->open($xmlInputFile);

    $writer = new XMLWriter();
    $writer->openUri($xmlOutputFile);

    $iterator = new XMLWritingIteration($writer, $reader);

    $segmentos = new XMLReader();
    $segmentos->open($xmlxliff);

    $writer->startDocument();
    $t=0;
    foreach ($iterator as $node) {
        $isElement = $node->nodeType === XMLReader::ELEMENT;

        if ($isElement && $node->name === 'w:t') {
        // increase <w:t> counter and find the same g id in the xliff
        $t++;
        $writer->startElement($node->name);
            while ($segmentos->read()){
                if ($segmentos->nodeType == XMLREADER::ELEMENT && $segmentos->name === 'g'){
                $gid = $segmentos->getAttribute('id');
                if ($gid === $t){
                    $texto = $segmentos->readInnerXML();
                    $writer->text($texto);
                }
                }
            }
            $writer->endElement();
        }else {
        // handle everything else
        $iterator->write();
        }
    }
    $writer->endDocument();

And the output in new_document.xml

<?xml version="1.0"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
 <w:body>
  <w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
   <w:pPr>
    <w:rPr>
     <w:b/>
    </w:rPr>
   </w:pPr>
   <w:r w:rsidRPr="004F10D0">
    <w:rPr>
    <w:b/> 
    </w:rPr>
     <w:t/><--self closing <w:t> tag
    CHAPTER 3 <-- original text was not replaced and now is outside the tag
    </w:r>
   </w:p>
  </w:body> <-- body closing tag after first paragraph
</w:document> <-- document closing tag
<w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="000C0514" w:rsidP="004F10D0"/> <-- more content after document closing tag
<w:p w:rsidR="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">... 

Solution

  • First of all, there indeed is a little problem with the code. I updated XMLReaderIterator to version 0.1.8 which contains as well a little fix that is useful in your scenario.

    The general problem with the flow in your example is that you don't forward the reading iterator. Therefore later on, those parts are written. This is why you see it at the end of the document. So it's not enough to write, but you also need to skip over the elements from the reading iterator you want to replace:

    $writer->startElement($node->name);
    
    $node->next();
    $iterator->skipNextRead();
    
    $writer->text(sprintf("TEXT #%d", $textCount));
    $writer->endElement();
    

    After starting the element, $node->next(); skips all subnodes (children) of the current $node element. This is necessary so that not later on these are output.

    Then $iterator->skipNextRead() tells the foreach to not advance once more (already done with next(), XMLReader is forward only). This method is new for the XMLWritingIteration in v0.1.8, so you need the update.

    Whole example (using your example XMLs):

    require('xmlreader-iterators.php'); // require XMLReaderIterator library
    
    $xmlInputFile = 'data/worddocument.xml';
    $xmlXliffFile = 'data/segments.xliff';
    
    $reader = new XMLReader();
    $reader->open($xmlInputFile);
    
    $writer = new XMLWriter();
    $writer->openMemory();
    
    $iterator = new XMLWritingIteration($writer, $reader);
    
    $writer->startDocument();
    
    $textCount = 0;
    foreach ($iterator as $node) {
        $isElement = $node->nodeType === XMLReader::ELEMENT;
    
        if ($isElement && $node->name === 'w:t') {
            $textCount++;
    
            $writer->startElement($node->name);
    
            $node->next();
            $iterator->skipNextRead();
    
            $writer->text(sprintf("TEXT #%d", $textCount));
            $writer->endElement();
        } else {
            // handle everything else
            $iterator->write();
        }
    }
    
    $writer->endDocument();
    echo $writer->outputMemory(true);
    

    Output:

    <?xml version="1.0"?>
    <w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
        <w:body>
            <w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
                <w:pPr>
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                </w:pPr>
                <w:r w:rsidRPr="004F10D0">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:t>TEXT #1</w:t>
                </w:r>
            </w:p>
            ...
            <w:p w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
                <w:pPr>
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                </w:pPr>
                <w:r w:rsidRPr="009D4166">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:t>TEXT #2</w:t>
                </w:r>
            </w:p>
            <w:sectPr w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidSect="004F10D0">
                <w:headerReference w:type="even" r:id="rId7"/>
                <w:pgSz w:w="11905" w:h="16840"/>
                <w:pgMar w:top="1417" w:right="1701" w:bottom="1417" w:left="1701" w:header="0" w:footer="1305" w:gutter="0"/>
                <w:cols w:space="720"/>
            </w:sectPr>
        </w:body>
    </w:document>
    

    I think this is more the kind of output you're trying to achieve. If the xliff file isn't that large, it's perhaps better to not use XMLReader to parse it but SimpleXMLElement or DOMDocument. Both have XPath which should be very handy to lookup the IDs therein and gather the fitting content quickly.