Search code examples
phpxmlloopsmysqlisimplexml

Large PHP for loop with SimpleXMLElement very slow: memory issues?


I currently have a bit of PHP code that basically pulls in data from an xml file and creates simple xml object using $products = new SimpleXMLElement($xmlString); I then loop over this code with a for loop within which I set the product details for each product in the XML document. Then it is saved to a mySql database.

Whilst running this script the products added reduce in frequency until they eventually stop before reaching the maximum. I have tried running garbage collection in intervals, to no avail. As well as unsetting various variables which doesn't seem to work .

Part of the code is shown below:

<?php
$servername = "localhost";
$username = "database.database";
$password = "demwke";
$database = "databasename";
$conn = new mysqli($servername, $username, $password, $database);

$file = "large.xml";
$xmlString = file_get_contents($file);
$products = new SimpleXMLElement($xmlString);
unset($xmlString, $file);
$total = count($products->datafeed[0]);

echo 'Starting<br><br>';

for($i=0;$i<$total;$i++){
    $id = $products->datafeed->prod[$i]['id'];
etc etc
    $sql = "INSERT INTO products (id, name, uid, cat, prodName, brand, desc, link, imgurl, price, subcat) VALUES ('$id', '$store', '$storeuid', '$category', '$prodName', '$brand', '$prodDesc', '$link', '$image', '$price', '$subCategory')";
}
echo '<br>Finished';
?>

The php variables are all defined using a similar line as with $id but removed to make easier reading.

Any ideas on what I can do/read to get this completing? The time taken doesn't really matter to me as long as it eventually completes.


Solution

  • Update: never use indexes with SimpleXML unless you have really few objects. Use foreach instead.:

    // Before, with [index]:
    for ($i=0;$i<$total;$i++) {
        $id = $products->datafeed->prod[$i]['id'];
        ...
    
    // After, with foreach():
    $i = 0;
    foreach ($products->datafeed->prod as $prod) {
        $i++; // Remove if you don't actually need $i
        $id = $prod['id'];
        ...
    

    In general, ...->node[$i] will access the array node[] and read it all up to the desired index, so that iterating the node array is not o(N), but o(N2). There is no workaround, because there is no guarantee that when you access item K, you've just accessed item K-1 (and so on recursively). foreach saves the pointer and thus works in o(N).

    For the same reason, it might be advantageous to iterate with foreach the whole array even if you really need only few, known items (unless they're few and very near the beginning of the array):

        $a[0] = $products->datafeed->prod[15]['id'];
        ...
        $a[35] = $products->datafeed->prod[1293]['id'];
    
    // After, with foreach():
    $want = [ 15, ... 1293 ];
    $i = 0;
    foreach ($products->datafeed->prod as $prod) {
        if (!in_array(++$i, $want)) {
            continue;
        }
        $a[] = $prod['id'];
    }
    

    You should first verify whether the increasing delay is caused by MySQLi or by XML processing. You can remove (comment out) the SQL query execution, and nothing else, from the cycle, to verify whether the speed (granted it will now be much higher... :-) ) remains now constant, or shows the same decrease.

    I suspect that the XML processing is the culprit, in here:

    for($i=0;$i<$total;$i++){
        $id = $products->datafeed->prod[$i]['id'];
    

    ...where you access an index which is farther and farther into a SimpleXMLObject. This might suffer from the problem of Schlemiel the Painter.

    The straight answer to your question, "how do I get the loop to complete, no matter the time", is "increase memory limit and max execution time".

    To improve performances, you can use a different interface into the feed object:

    $i = -1;
    foreach ($products->datafeed->prod as $prod) {
        $i++;
        $id = $prod['id'];
        ...
    }
    

    Experimenting

    I use this small program to read a large XML and iterate its content:

    // Stage 1. Create a large XML.
    $xmlString = '<?xml version="1.0" encoding="UTF-8" ?>';
    $xmlString .= '<content><package>';
    for ($i = 0; $i < 100000; $i++) {
        $xmlString .=  "<entry><id>{$i}</id><text>The quick brown fox did what you would expect</text></entry>";
    }
    $xmlString .= '</package></content>';
    
    // Stage 2. Load the XML.
    $xml    = new SimpleXMLElement($xmlString);
    
    $tick   = microtime(true);
    for ($i = 0; $i < 100000; $i++) {
        $id = $xml->package->entry[$i]->id;
        if (0 === ($id % 5000)) {
            $t = microtime(true) - $tick;
            print date("H:i:s") . " id = {$id} at {$t}\n";
            $tick = microtime(true);
        }
    }
    

    After generating the XML, a cycle parses it and prints how much does it take to iterate 5000 elements. To verify it is indeed the time delta, the date is also printed. The delta should be approximately the difference in time between the timestamps.

    21:22:35 id = 0 at 2.7894973754883E-5
    21:22:35 id = 5000 at 0.38135695457458
    21:22:38 id = 10000 at 2.9452259540558
    21:22:44 id = 15000 at 5.7002019882202
    21:22:52 id = 20000 at 8.0867099761963
    21:23:02 id = 25000 at 10.477082967758
    21:23:15 id = 30000 at 12.81209897995
    21:23:30 id = 35000 at 15.120756149292
    

    So that's what happens: processing the XML array goes slower and slower.

    This is mostly the same program using foreach:

    // Stage 1. Create a large XML.
    $xmlString = '<?xml version="1.0" encoding="UTF-8" ?>';
    $xmlString .= '<content><package>';
    for ($i = 0; $i < 100000; $i++) {
        $xmlString .=  "<entry><id>{$i}</id><text>The quick brown fox did ENTRY {$i}.</text></entry>";
    }
    $xmlString .= '</package></content>';
    
    // Stage 2. Load the XML.
    $xml    = new SimpleXMLElement($xmlString);
    
    $i      = 0;
    $tick   = microtime(true);
    foreach ($xml->package->entry as $data) {
        // $id = $xml->package->entry[$i]->id;
        $id = $data->id;
        $i++;
        if (0 === ($id % 5000)) {
            $t = microtime(true) - $tick;
            print date("H:i:s") . " id = {$id} at {$t} ({$data->text})\n";
            $tick = microtime(true);
        }
    }
    

    The times seem to be constant now... I say "seem" because they appear to have decreased by a factor of about ten thousand, and I have some difficulties in getting reliable measurements.

    (And no, I had no idea. I probably never used indexes with large XML arrays).

    21:33:42 id = 0 at 3.0994415283203E-5 (The quick brown fox did ENTRY 0.)
    21:33:42 id = 5000 at 0.0065329074859619 (The quick brown fox did ENTRY 5000.)
    ...
    21:33:42 id = 95000 at 0.0065121650695801 (The quick brown fox did ENTRY 95000.)