Search code examples
phpxmlxml-parsingxmlreader

Parsing XML-document (odt-file): How to step through elements to fill an array


I try to parse a XML-document (content.xml of a odt-file).

$reader = new XMLReader();
if (!$reader->open("content.xml")) die("Failed to open 'content.xml'");
    // step through text:h and text:p elements to put them into an array
    while ($reader->read()){ 
        if ($reader->nodeType == XMLREADER::ELEMENT && ($reader->name === 'text:h' || $reader->name === 'text:p')) {  
            echo $reader->expand()->textContent; // Put the text into array in correct order...
        }
    }
$reader->close();

First of all I need just a little hint how to step correctly through the elements of the XML-file. In my attempt I can step through the text:h-elements, but how do I get the other elements (text:p), without messing up everything...

Nevertheless I'll show you my final target at all. Please don't think that I'm asking for a complete solution. I just wrote everything down to show which structure I need. I want to solve this problem step by step

The content of this xml-file is something like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
[...]
<office:body>
    <office:text text:use-soft-page-breaks="true">
        <text:h text:style-name="P1" text:outline-level="2">Chapter 1</text:h>
            <text:p text:style-name="Standard">Lorem ipsum. </text:p>

            <text:h text:style-name="Heading3" text:outline-level="3">Subtitle 1</text:h>
                <text:p text:style-name="Standard"><text:span text:style-name="T2">Something 1:</text:span> Lorem.</text:p>
                <text:p text:style-name="Standard"><text:span text:style-name="T3">Something 2:</text:span><text:s/>Lorem ipsum.</text:p>
                <text:p text:style-name="Standard"><text:span text:style-name="T4">Something 3:</text:span> Lorem ipsum.</text:p>

            <text:h text:style-name="Heading3" text:outline-level="3">Subtitle 2</text:h>
                <text:p text:style-name="Standard"><text:span text:style-name="T5">10</text:span><text:span text:style-name="T6">:</text:span><text:s/>Text (100%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard">9.7:<text:s/>Text (97%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard"><text:span text:style-name="T9">9.1:</text:span><text:s/>Text (91%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                    <text:p text:style-name="Explanation">More furter informations.</text:p>

            [Subtitle 3 and 4]

            <text:h text:style-name="Heading3" text:outline-level="3">Subtitle 5</text:h>
                <text:p text:style-name="Standard"><text:span text:style-name="T5">10</text:span><text:span text:style-name="T6">:</text:span><text:s/>Text (100%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard">9.7:<text:s/>Text (97%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard"><text:span text:style-name="T9">9.1:</text:span><text:s/>Text (91%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                    <text:p text:style-name="Explanation">More furter informations.</text:p>

            <text:h text:style-name="Heading3" text:outline-level="3">References</text:h>
                <text:list text:style-name="LFO44" text:continue-numbering="true">
                    <text:list-item><text:p text:style-name="P25">blabla et al., Any Title p. 580-586</text:p></text:list-item>
                    <text:list-item><text:p text:style-name="P25">blabla et al., Any Title p. 580-586</text:p></text:list-item>
                    <text:list-item><text:p text:style-name="P25">blabla et al., Any Title p. 580-586</text:p></text:list-item>
                    <text:list-item><text:p text:style-name="P25">blabla et al., Any Title p. 580-586</text:p></text:list-item>
                </text:list>

        [Multiple Chapter like this]

    </office:text>
</office:body>

You see, that the "subchapters" always have standard-elements and an optional explanation-element (also multiple explanation-elements for one standard are possible). This structure is always the same...

My final target is to split all the informations to get an Array-Output like this:

array() {
  [1]=>
  array() {
    ["chapter"]=>
    string() "Chapter 1"
    ["content"]=>
    array() {
      [0]=>
      array() {
        ["subchapter"]=>
        string() "Description"
        ["content"]=>
        array() {
          [0]=>
          array() {
            ["standard"]=>
            string() "Lorem ipsum."
            ["explanation"]=>
            string(0) ""
          }
        }
      }
      [1]=>
      array() {
        ["subchapter"]=>
        string() "Subtitle 1"
        ["content"]=>
        array() {
          [0]=>
          array() {
            ["standard"]=>
            string() "Something 1: Lorem."
            ["explanation"]=>
            string() ""
          }
          [1]=>
          array() {
            ["standard"]=>
            string() "Something 2: Lorem ipsum."
            ["explanation"]=>
            string() ""
          }
          [2]=>
          array() {
            ["standard"]=>
            string() "Something 2: Lorem ipsum."
            ["explanation"]=>
            string() ""
          }          
        }
      }
      [2]=>
      array() {
        ["subchapter"]=>
        string() "Subtitle 2"
        ["content"]=>
        array() {
          [0]=>
          array() {
            ["standard"]=>
            string() "10: Text (100%)"
            ["explanation"]=>
            string() "Further informations."
          }
    [and so on] 

Solution

  • edit:

    I can see your issue now, thanks for editing the question:

    in your while loop

    while ($reader->read()){ 
    
    }
    

    You have a couple of functions available to get the nodes and values:

    $reader->value
    

    will give the value (eg 'Subtitle 1')

    $reader->getAttribute('text:style-name')
    

    Should get the 'Heading3' part

    Putting it altogether, you probably want something like this inside the while loop [pseudocode]:

     // set an index
     $i = 0;
     // get the parts fromt he xml we need
     $name = $reader->name;
     $attrib = $reader->getAttribute('text:style-name');
     $value = $reader->value;
    
     // if the attribute is a 'P1', then increment our index, as we need a new indentation in our array
     if($value == 'P1'){
         $i++;
     }
    
     $array[$i][$attrib]=$reader->value;    
    

    note that this will only do the indentation to one level - it looks like you need 4 levels, so you should probably have 4 indexes [$i,$k,$k,$l] and check each one against each thing that needs indented - P1,Heading3, etc

    you might end up with

    $array[$i][$j][$k] = $reader->value;
    

    or the like. Remember to re-set all your sub-indexes when you incrment a higher index (eg if you $i++, set $j=0, $k=0, etc)

    previous answers below:

    SimpleXML could (probably) do this in a few lines [if the structure of the xml file is already nested the correct way, which, after a quick look, it appears to be]: http://php.net/manual/en/book.simplexml.php

    $xml = simplexml_load_file('content.xml');
    $json = json_encode($xml);
    $array = json_decode($json,TRUE);
    
    print_r($array);
    

    edit: you can also use xpath with simplexml, and you can do things like

    echo $xml->{office:body}->{office:text}->{text.h}