Search code examples
phpxmlswitch-statementxmlreader

Get specific tag with XMLReader in PHP


I have the following XML-structure in my XML file (it's not the whole XML-file, only a part of it):

<?xml version="1.0" encoding="utf-8"?>
    <extensions>
        <extension extensionkey="fp_product_features">
            <downloadcounter>355</downloadcounter>
            <version version="0.1.0">
                <title>Product features</title>
                <description/>
                <downloadcounter>24</downloadcounter>
                <state>beta</state>
                <reviewstate>0</reviewstate>
                <category>plugin</category>
                <lastuploaddate>1142878270</lastuploaddate>
                <uploadcomment> added related features</uploadcomment>
            </version>
        </extension>
    </extensions>

The file is too big for SimpleXML, so I'm using XMLReader. I have a switch that checks for the XML-tags and their content:

while ($xmlReader->read()) {

                if ($xmlReader->nodeType == XMLReader::ELEMENT) {

                    switch ($xmlReader->name) {

                        case "title" :

                            $xmlReader->read();
                            $foo = $xmlReader->value;
                            //Do stuff with the value

                            break;

                        case  "description":

                            $xmlReader->read();
                            $bar = $xmlReader->value;
                           //Do stuff with the value

                            break;

                        case "downloadcounter" :

                            $xmlReader->read();
                            $foobar = $xmlReader->value;
                           //Do stuff with the value

                            break;

                        case "state" :

                            $xmlReader->read();
                            $barfoo = $xmlReader->value;
                            //Do stuff with the value

                        break;


                     //Repeat for other tags

                    }
                }
            }

The problem here is that there are two <downloadcounter> tags. The one beneath <extension> and the one beneath <version>. I need the one beneath <version>, but the code in my switch is giving me the one beneath <extension>. All the other cases are giving me the right information.

I have thought about some solutions. Maybe there is a way where I can specify that XMLReader only reads the tag after <description>? I've been using the $xmlReader->read() function multiple times in one case, but that didn't help. I'm very new to this, so maybe it is not the right the way to do it, but if anyone can point me in the right direction, it would be much appreciated.

Thanks in advance!


Solution

  • Ok, some notes on this...

    The file is too big for SimpleXML, so I'm using XMLReader.

    That would mean that loading the XML file with SimpleXML reaches PHP's memory_limit, right? Alternatives would be to stream or chunk read the XML file and process the parts.

    $xml_chunk = (.... read file chunked ...)
    $xml = simplexml_load_string($xml_chunk);
    $json = json_encode($xml);
    $array = json_decode($json,TRUE);
    

    But working with XMLReader is fine!

    Maybe there is a way where I can specify that XMLReader only reads the tag after ?

    Yes, there is. Like "i alarmed alien" pointed out: if you work with DomDocument, you can use an Xpath query to reach the exact (node|item|element) you want.

    $dom = new DomDocument();
    $dom->load("tooBig.xml");
    $xp = new DomXPath($dom);
    
    $result = $xp->query("/extensions/extension/version/downloadcounter");
    
    print $result->item(0)->nodeValue ."\n";
    

    For more examples see the PHP manual: http://php.net/manual/de/domxpath.query.php


    If you want to stick to XMLReader:

    The XMLReader extension is an XML Pull parser. The reader is going forward on the document stream, stopping on each node on the way. This explains why you get the first from beneath the tag, but not the one beneath . This makes iterations hard, because lookahead and stuff is not really possible without re-reading.

    DEMO http://ideone.com/Oykfyh

    <?php
    
    $xml = <<<'XML'
    <?xml version="1.0" encoding="utf-8"?>
        <extensions>
            <extension extensionkey="fp_product_features">
                <downloadcounter>355</downloadcounter>
                <version version="0.1.0">
                    <title>Product features</title>
                    <description/>
                    <downloadcounter>24</downloadcounter>
                    <state>beta</state>
                    <reviewstate>0</reviewstate>
                    <category>plugin</category>
                    <lastuploaddate>1142878270</lastuploaddate>
                    <uploadcomment> added related features</uploadcomment>
                </version>
            </extension>
        </extensions>
    XML;
    
    $reader = new XMLReader();
    $reader->open('data:/text/plain,'.urlencode($xml));
    
    $result = [];
    $element = null;
    
    while ($reader->read()) {
    
      if($reader->nodeType === XMLReader::ELEMENT) 
      {
        $element = $reader->name;
    
        if($element === 'extensions') {
            $result['extensions'] = array();
        }
    
        if($element === 'extension') {
            $result['extensions']['extension'] = array();
        }
    
        if($element === 'downloadcounter') {
            if(!is_array($result['extensions']['extension']['version'])) {
                $result['extensions']['extension']['downloadcounter'] = '';
            } /*else {
                $result['extensions']['extension']['version']['downloadcounter'] = '';
            }*/
        }
    
        if($element === 'version') {
            $result['extensions']['extension']['version'] = array();
            while ($reader->read()) {
               if($reader->nodeType === XMLReader::ELEMENT) 
               {
                   $element = $reader->name;
                   $result['extensions']['extension']['version'][$element] = '';
               }
               if($reader->nodeType === XMLReader::TEXT) 
               {
                   $value = $reader->value;
                   $result['extensions']['extension']['version'][$element] = $value;
               }
            }
        }
      }
    
      if($reader->nodeType === XMLReader::TEXT) 
      {
        $value = $reader->value;
    
        if($element === 'downloadcounter') {
            if(!is_array($result['extensions']['extension']['version'])) {
                $result['extensions']['extension']['downloadcounter'] = $value;
            }
            if(is_array($result['extensions']['extension']['version'])) {
                $result['extensions']['extension']['version']['downloadcounter'] = $value;
            }
        }
      }
    }
    $reader->close();
    
    echo var_export($result, true);
    

    Result:

    array (
      'extensions' => 
      array (
        'extension' => 
        array (
          'downloadcounter' => '355',
          'version' => 
          array (
            'title' => 'Product features',
            'description' => '',
            'downloadcounter' => '24',
            'state' => 'beta',
            'reviewstate' => '0',
            'category' => 'plugin',
            'lastuploaddate' => '1142878270',
            'uploadcomment' => ' added related features',
          ),
        ),
      ),
    )
    

    This transform your XML into an array (with nested arrays). It's not really perfect, because of unnecessary iterations. Feel free to hack away...

    Additionally: - Parsing Huge XML Files in PHP - https://github.com/prewk/XmlStreamer