Search code examples
phpxmlxml-parsingxmlreader

Using XMLreader to read and parse large XML files. Empty values problem


I need to read XML files about 1 GB in size. My XML:

<products>
<product>
<categoryName>Kable i konwertery AV</categoryName>
<brandName>Belkin</brandName>
<productCode>AV10176bt1M-BLK</productCode>
<productId>5616488</productId>
<productFullName>Kabel Belkin Kabel HDMI Ultra HD High Speed 1m-AV10176bt1M-BLK</productFullName>
<productEan>0745883767465</productEan>
<productEuroPriceNetto>59.71</productEuroPriceNetto>
<productFrontendPriceNetto>258.54</productFrontendPriceNetto>
<productFastestSupplierQuantity>23</productFastestSupplierQuantity>
<deliveryEstimatedDays>2</deliveryEstimatedDays>
</product>
<product>
<categoryName>Telewizory</categoryName>
<brandName>Sony</brandName>
<productCode>KDL32WD757SAEP</productCode>
<productId>1005662</productId>
<productFullName>Telewizor Sony KDL-32WD757 SAEP</productFullName>
<productEan></productEan>
<productEuroPriceNetto>412.33</productEuroPriceNetto>
<productFrontendPriceNetto>1785.38</productFrontendPriceNetto>
<productFastestSupplierQuantity>11</productFastestSupplierQuantity>
<deliveryEstimatedDays>6</deliveryEstimatedDays>
</product>
<product>
<categoryName>Kuchnie i akcesoria</categoryName>
<brandName>Brimarex</brandName>
<productCode>1566287</productCode>
<productId>885156</productId>
<productFullName>Brimarex Drewniane owoce, Kiwi - 1566287</productFullName>
<productEan></productEan>
<productEuroPriceNetto>0.7</productEuroPriceNetto>
<productFrontendPriceNetto>3.05</productFrontendPriceNetto>
<productFastestSupplierQuantity>7</productFastestSupplierQuantity>
<deliveryEstimatedDays>3</deliveryEstimatedDays>
</product>
</products>

I use XML reader.

$reader = new XMLReader();
$reader->open($url);
$count = 0;

while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT)
        $nodeName = $reader->name;

    if(($reader->nodeType == XMLReader::TEXT || $reader->nodeType == XMLReader::CDATA)) {

        if ($nodeName == 'categoryName') $categoryName = $reader->value;
        if ($nodeName == 'brandName') $brandName = $reader->value;
        if ($nodeName == 'productCode') $productCode = $reader->value;
        if ($nodeName == 'productId') $productId = $reader->value;
        if ($nodeName == 'productFullName') $productFullName = $reader->value;
        if ($nodeName == 'productEan') $productEan = $reader->value;
        if ($nodeName == 'productEuroPriceNetto') $productEuroPriceNetto = $reader->value;
        if ($nodeName == 'productFastestSupplierQuantity') $productFastestSupplierQuantity = $reader->value;
        if ($nodeName == 'deliveryEstimatedDays') $deliveryEstimatedDays = $reader->value;
    }

    if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'product') {
        $count++;
    }
}
$reader->close();

All is working fine except one problem... When some value is missing, for example <productEan></productEan> in output I am getting a value from the previous, not empty tag till another tag which is not empty.

For example, if previous node is like in example <productEan>0745883767465</productEan> and another two <productEan></productEan> are empty in output array I getting same value, 0745883767465.

What is the right way to solve this problem? Or maybe some one have working solution...


Solution

  • Here's some code that will do what you want. It saves the value for each element when it encounters a TEXT or CDATA node, then stores it when it gets to END_ELEMENT. At that time the saved value is set to '', so that if no value is found for an element, it gets an empty string (this could be changed to null if you prefer). It also deals with self-closing tags for example <brandName /> with an isEmptyElement check when a ELEMENT node is found. It takes advantage of PHPs variable variables to avoid the long sequence of if ($nodename == ...) that you have in your code, but also uses an array to store the values for each product, which longer term I think is a better solution for your problem.

    $reader = new XMLReader();
    $reader->xml($xml);
    $count = 0;
    $this_value = '';
    $products = array();
    while($reader->read()) {
        switch ($reader->nodeType) {
            case XMLReader::ELEMENT:
                // deal with self-closing tags e.g. <productEan />
                if ($reader->isEmptyElement) {
                    ${$reader->name} = '';
                    $products[$count][$reader->name] = '';
                }
                break;
            case XMLReader::TEXT:
            case XMLReader::CDATA:
                // save the value for storage when we get to the end of the element
                $this_value = $reader->value;
                break;
            case XMLReader::END_ELEMENT:
                if ($reader->name == 'product') {
                    $count++;
                    print_r(array($categoryName, $brandName, $productCode, $productId, $productFullName, $productEan, $productEuroPriceNetto, $productFrontendPriceNetto, $productFastestSupplierQuantity, $deliveryEstimatedDays));
                }
                elseif ($reader->name != 'products') {
                    ${$reader->name} = $this_value;
                    $products[$count][$reader->name] = $this_value;
                    // set this_value to a blank string to allow for empty tags
                    $this_value = '';
                }
                break;
            case XMLReader::WHITESPACE:
            case XMLReader::SIGNIFICANT_WHITESPACE:
            default:
                // nothing to do
                break;
        }
    }
    $reader->close();
    print_r($products);
    

    I've omitted the output as it's quite long but you can see the code in operation in this demo on 3v4l.org.