Search code examples
phpmemory-efficient

Memory-efficient representation of dates in php


I am computing a large list of date pairs (start and end) and am running into memory issues. My current representation is string based and still inefficient:

Currently I serialize the two dates as manual-YYYY-MM-DD-YYYY-MM-DD which is 28 bytes. I could get this down to 16 by switching to YYYYMMDDYYYYMMDD.

However if I am changing the format now, I was wondering if there's an even more efficient solution. In theory there are not too many dates, so a 4 byte int should suffice, bringing me down to 8 byte.

I initially decided for a string representation because it's (a) human readable and (b) easily deduplicatable (think of an array of such periods and in_array)

To give an idea of the numbers: I am setting up an array of period arrays, so a list of 813.000 arrays that contains altogether 65 million such items (many many duplicates, most of the time it's month start-to-end like 2021-05-01-2021-05-31). So I am also interested if some lookup dictionary soluotion might help...


Solution

  • To save memory, you could store a date as an int representing the number of days since a reference date, such as the earliest date, if you know what that is in advance, or maybe a date definitely earlier than the entire data set otherwise.

    An int occupies the size of a zend_value, 8 bytes in PHP >= 7, since it's stored directly as part of the zend_value itself (see zend_value, which is part of _zval_struct, aka zval) and thus has no memory overhead at all.

    It's the lowest possible memory footprint a value can have, since anything other than an int (lval in zend_value, a zend_long) or double (dval in zend_value, a double) is stored as a pointer to a value allocated separately (*str, *arr, *obj, etc... in zend_value), which takes up both the size of the pointer, which is equal to the size of the zend_value, and the size of the structure it's pointing to.

    For example, a string (zend_string) takes: 8 bytes (the size of zend_refcounted_h for gc) + 8 bytes (the size of zend_ulong for h) + 8 bytes (the size of size_t for len) + 1 byte (the size of char for val) = 25 bytes of overhead + 8 bytes (the size of the zend_value) = 33 bytes for an empty string + one more byte for each character (or multiple for multibyte strings).

    You could fit 4 ints in the memory occupied by an empty string.

    If you want to calculate exactly how much memory a PHP value takes, keep in mind that you need to account for the overhead added by the zval containing the zend_value, which can vary depending on PHP version.

    Since you have date ranges, you could also consider representing a range as an int (something like $first_day * 1000000 + $last_day), thus saving the overhead of having two zvals to represent two ints.


    A lookup table should help too, especially if you have a lot of similar dates.

    You could setup an array of dates and an index array. When you encounter a date, see if index[date] exists and, if so, then the value is the index in the date array, if not, then insert it in the date array and store the index. Something like:

    if (!isset($index[$date])) {
        $dates []= $date;
        $index[$date] = count($dates) - 1;
    }
    $date_index = $index[$date];
    

    Now you can store $date_indexes instead of the full dates, and can reference $index to retrieve dates when needed.


    The dates don't have to be human readable in memory, just when you display them.

    So you can write a display function that does all that's necessary to display the data in human readable form: obtain the date from $dates using $date_index, then convert it from its internal representation to a readable string.


    I also encourage you to consider El_Vanja's comment. This sounds like a long running task, not something the user needs to see as a response to a request, so maybe you could process the data in smaller batches instead of processing it all at once.

    If the input doesn't change, and therefore the result of the computation doesn't change either, then also consider caching the result so you don't have to do all of this again. In this situation, you could also consider doing the computation and caching the result as soon as a number of data entries become available, which should use a lot less processing power and have results available sooner as well.