Search code examples
phpapclarge-data

PHP and the million array baby


Imagine you have the following array of integers:

array(1, 2, 1, 0, 0, 1, 2, 4, 3, 2, [...] );

The integers go on up to one million entries; only instead of being hardcoded they've been pre-generated and stored in a JSON formatted file (of approximately 2MB in size). The order of these integers matters, I can't randomly generate it every time because it should be consistent and always have the same values at the same indexes.

If this file is read back in PHP afterwards (e.g. using file_get_contents + json_decode) it takes from 700 to 900ms just to get the array back — "Okay" I thought, "it's probably reasonable since json_decode has to parse about 2 million characters, let's cache it". APC caches it in a entry that takes about 68MB, probably normal, zvals are large. Retrieving however this array back from APC also takes some good 600ms which is in my eyes still way too much.

Edit: APC does serialize/unserialize to store and retrieve content which with a million item array is a lengthy and heavy process.

So the questions:

  • Should I expect this latency if I intend to load a one million entries array, no matter the data store or the method, in PHP? As far as I understand APC stores the zval itself, so theoretically retrieving it from APC should be as fast as it can possibly get (no parsing, no conversion, no disk access)

  • Why is APC so slow for something so seemingly simple?

  • Is there any efficient way to load a one million entries array entirely in memory using PHP? assuming RAM usage is not a problem.

  • If I were to access only slices of this array based on indexes (e.g. loading the chunk from index 15 to index 76) and never actually have the entire array in memory (yes, I understand this is the sane way of doing it, but I wanted to know all the sides), what would be the most efficient data store system for the complete array? Obviously not a RDBM; I'm thinking redis, but I would be happy to hear other ideas.


Solution

  • Say the integers are all 0-15. Then you can store 2 per byte:

    <?php
    $data = '';
    for ($i = 0; $i < 500000; ++$i)
      $data .= chr(mt_rand(0, 255));
    
    echo serialize($data);
    

    To run: php ints.php > ints.ser

    Now you have a file with a 500000 byte string containing 1,000,000 random integers from 0 to 15.

    To load:

    <?php
    $data = unserialize(file_get_contents('ints.ser'));
    
    function get_data_at($data, $i)
    {
      $data = ord($data[$i >> 1]);
    
      return ($i & 1) ? $data & 0xf : $data >> 4;
    }
    
    for ($i = 0; $i < 1000; ++$i)
      echo get_data_at($data, $i), "\n";
    

    The loading time on my machine is about .002 seconds.

    Of course this might not be directly applicable to your situation, but it will be much faster than a bloated PHP array of a million entries. Quite frankly, having an array that large in PHP is never the proper solution.

    I'm not saying this is the proper solution either, but it definitely is workable if it fits your parameters.

    Note that if your array had integers in the 0-255 range, you could get rid of the packing and just access the data as ord($data[$i]). In that case, your string would be 1M bytes long.

    Finally, according to the documentation of file_get_contents(), php will memory map the file. If so, your best performance would be to dump raw bytes to a file, and use it like:

    $ints = file_get_contents('ints.raw');
    echo ord($ints[25]);
    

    This assumes that ints.raw is exactly one million bytes long.