Search code examples
c++arraysperformanceoptimizationmemory-bandwidth

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?


Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)

uint8_t MyArray[10000000];

when the value at any position in the array is

  • 0 or 1 for 95% of all cases,
  • 2 in 4% of cases,
  • between 3 and 255 in the other 1% of cases?

So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.

I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude. It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.


Solution

  • A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.

    When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.

    std::vector<uint8_t> main_arr;
    std::vector<uint32_t> sec_arr;
    
    uint8_t lookup(unsigned idx) {
        // extract the 2 bits of our interest from the main array
        uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
        // usual (likely) case: value between 0 and 2
        if(v != 3) return v;
        // bad case: lookup the index<<8 in the secondary array
        // lower_bound finds the first >=, so we don't need to mask out the value
        auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
    #ifdef _DEBUG
        // some coherency checks
        if(ptr == sec_arr.end()) std::abort();
        if((*ptr >> 8) != idx) std::abort();
    #endif
        // extract our 8-bit value from the 32 bit (index, value) thingie
        return (*ptr) & 0xff;
    }
    
    void populate(uint8_t *source, size_t size) {
        main_arr.clear(); sec_arr.clear();
        // size the main storage (round up)
        main_arr.resize((size+3)/4);
        for(size_t idx = 0; idx < size; ++idx) {
            uint8_t in = source[idx];
            uint8_t &target = main_arr[idx>>2];
            // if the input doesn't fit, cap to 3 and put in secondary storage
            if(in >= 3) {
                // top 24 bits: index; low 8 bit: value
                sec_arr.push_back((idx << 8) | in);
                in = 3;
            }
            // store in the target according to the position
            target |= in << ((idx & 3)*2);
        }
    }
    

    For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).

    If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.


    Quick benchmark

    #include <algorithm>
    #include <vector>
    #include <stdint.h>
    #include <chrono>
    #include <stdio.h>
    #include <math.h>
    
    using namespace std::chrono;
    
    /// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
    /// than LCG but fail some test suites
    struct XorShift32 {
        /// This stuff allows to use this class wherever a library function
        /// requires a UniformRandomBitGenerator (e.g. std::shuffle)
        typedef uint32_t result_type;
        static uint32_t min() { return 1; }
        static uint32_t max() { return uint32_t(-1); }
    
        /// PRNG state
        uint32_t y;
    
        /// Initializes with seed
        XorShift32(uint32_t seed = 0) : y(seed) {
            if(y == 0) y = 2463534242UL;
        }
    
        /// Returns a value in the range [1, 1<<32)
        uint32_t operator()() {
            y ^= (y<<13);
            y ^= (y>>17);
            y ^= (y<<15);
            return y;
        }
    
        /// Returns a value in the range [0, limit); this conforms to the RandomFunc
        /// requirements for std::random_shuffle
        uint32_t operator()(uint32_t limit) {
            return (*this)()%limit;
        }
    };
    
    struct mean_variance {
        double rmean = 0.;
        double rvariance = 0.;
        int count = 0;
    
        void operator()(double x) {
            ++count;
            double ormean = rmean;
            rmean     += (x-rmean)/count;
            rvariance += (x-ormean)*(x-rmean);
        }
    
        double mean()     const { return rmean; }
        double variance() const { return rvariance/(count-1); }
        double stddev()   const { return std::sqrt(variance()); }
    };
    
    std::vector<uint8_t> main_arr;
    std::vector<uint32_t> sec_arr;
    
    uint8_t lookup(unsigned idx) {
        // extract the 2 bits of our interest from the main array
        uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
        // usual (likely) case: value between 0 and 2
        if(v != 3) return v;
        // bad case: lookup the index<<8 in the secondary array
        // lower_bound finds the first >=, so we don't need to mask out the value
        auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
    #ifdef _DEBUG
        // some coherency checks
        if(ptr == sec_arr.end()) std::abort();
        if((*ptr >> 8) != idx) std::abort();
    #endif
        // extract our 8-bit value from the 32 bit (index, value) thingie
        return (*ptr) & 0xff;
    }
    
    void populate(uint8_t *source, size_t size) {
        main_arr.clear(); sec_arr.clear();
        // size the main storage (round up)
        main_arr.resize((size+3)/4);
        for(size_t idx = 0; idx < size; ++idx) {
            uint8_t in = source[idx];
            uint8_t &target = main_arr[idx>>2];
            // if the input doesn't fit, cap to 3 and put in secondary storage
            if(in >= 3) {
                // top 24 bits: index; low 8 bit: value
                sec_arr.push_back((idx << 8) | in);
                in = 3;
            }
            // store in the target according to the position
            target |= in << ((idx & 3)*2);
        }
    }
    
    volatile unsigned out;
    
    int main() {
        XorShift32 xs;
        std::vector<uint8_t> vec;
        int size = 10000000;
        for(int i = 0; i<size; ++i) {
            uint32_t v = xs();
            if(v < 1825361101)      v = 0; // 42.5%
            else if(v < 4080218931) v = 1; // 95.0%
            else if(v < 4252017623) v = 2; // 99.0%
            else {
                while((v & 0xff) < 3) v = xs();
            }
            vec.push_back(v);
        }
        populate(vec.data(), vec.size());
        mean_variance lk_t, arr_t;
        for(int i = 0; i<50; ++i) {
            {
                unsigned o = 0;
                auto beg = high_resolution_clock::now();
                for(int i = 0; i < size; ++i) {
                    o += lookup(xs() % size);
                }
                out += o;
                int dur = (high_resolution_clock::now()-beg)/microseconds(1);
                fprintf(stderr, "lookup: %10d µs\n", dur);
                lk_t(dur);
            }
            {
                unsigned o = 0;
                auto beg = high_resolution_clock::now();
                for(int i = 0; i < size; ++i) {
                    o += vec[xs() % size];
                }
                out += o;
                int dur = (high_resolution_clock::now()-beg)/microseconds(1);
                fprintf(stderr, "array:  %10d µs\n", dur);
                arr_t(dur);
            }
        }
    
        fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");
        printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
                lk_t.mean(), lk_t.stddev(),
                arr_t.mean(), arr_t.stddev(),
                arr_t.mean()/lk_t.mean());
        return 0;
    }
    

    (code and data always updated in my Bitbucket)

    The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:

    • performs a random lookup of 10M elements with my data structure
    • does the same through the original array.

    (notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)

    These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).

    I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.

                CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)
    Xeon E5-1650 v3 @ 3.50GHz | 15360 KB |  60011 ±  3667 |   29313 ±  2137 | 0.49
    Xeon E5-2697 v3 @ 2.60GHz | 35840 KB |  66571 ±  7477 |   33197 ±  3619 | 0.50
    Celeron G1610T  @ 2.30GHz |  2048 KB | 172090 ±   629 |  162328 ±   326 | 0.94
    Core i3-3220T   @ 2.80GHz |  3072 KB | 111025 ±  5507 |  114415 ±  2528 | 1.03
    Core i5-7200U   @ 2.50GHz |  3072 KB |  92447 ±  1494 |   95249 ±  1134 | 1.03
    Xeon X3430      @ 2.40GHz |  8192 KB | 111303 ±   936 |  127647 ±  1503 | 1.15
    Core i7 920     @ 2.67GHz |  8192 KB | 123161 ± 35113 |  156068 ± 45355 | 1.27
    Xeon X5650      @ 2.67GHz | 12288 KB | 106015 ±  5364 |  140335 ±  6739 | 1.32
    Core i7 870     @ 2.93GHz |  8192 KB |  77986 ±   429 |  106040 ±  1043 | 1.36
    Core i7-6700    @ 3.40GHz |  8192 KB |  47854 ±   573 |   66893 ±  1367 | 1.40
    Core i3-4150    @ 3.50GHz |  3072 KB |  76162 ±   983 |  113265 ±   239 | 1.49
    Xeon X5650      @ 2.67GHz | 12288 KB | 101384 ±   796 |  152720 ±  2440 | 1.51
    Core i7-3770T   @ 2.50GHz |  8192 KB |  69551 ±  1961 |  128929 ±  2631 | 1.85
    

    The results are... mixed!

    1. In general, on most of these machines there is some kind of speedup, or at least they are on a par.
    2. The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
    3. At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
    4. The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.