c++performance montecarlo mersenne-twister

On the random number generation dSFMT performance on C++

I'm trying to find the most efficient way to generate random numbers for a MC simulation I'm working on. I've been reading a lot about the double precision Mersenne Twister algorithm and I wanted to understand a couple of basic things before moving on.

I compiled and run the test provided by the official dSFMT files and this the best result for my system:

C:\TDM-GCC-64\C++ Tests\dSFMT>test-sse2-M19937 -s
consumed time for generating 100000000 randoms.
ST BLOCK [0, 1) AVE: 115ms.
ST BLOCK (0, 1] AVE: 108ms.
ST BLOCK (0, 1) AVE: 106ms.
ST BLOCK [1, 2) AVE:  77ms.
ST SEQ [0, 1) 1 AVE: 174ms.
ST SEQ [0, 1) 2 AVE: 207ms.
total = 500014655.815776
ST SEQ (0, 1] 1 AVE: 173ms.
ST SEQ (0, 1] 2 AVE: 205ms.
total = 500035344.184224
ST SEQ (0, 1) 1 AVE: 209ms.
ST SEQ (0, 1) 2 AVE: 247ms.
total = 500014655.815776
ST SEQ [1, 2) 1 AVE: 173ms.
ST SEQ [1, 2) 2 AVE: 204ms.
total = 1500064655.815183

My questions are:

Why is generating [1,2) faster than [0,1)?
Why block generation is faster than sequential? Shouldn't allocating a big array and having to delete and rewrite on it be slower?
If I need to generate 1e12 numbers, what would be the best strategy? If doing it in blocks, what's the optimal array size?

Solution

Numbers inside library are generated from [1,2) interval. Other ranges are expressed as a functions on top of this interval.

"Basic" interval [1,2) generator:

inline static double dsfmt_genrand_close1_open2(dsfmt_t *dsfmt) {
    double r;
    double *psfmt64 = &dsfmt->status[0].d[0];

    if (dsfmt->idx >= DSFMT_N64) {
        dsfmt_gen_rand_all(dsfmt);
        dsfmt->idx = 0;
    }
    r = psfmt64[dsfmt->idx++];
    return r;
}

Interval [0, 1):

inline static double dsfmt_genrand_close_open(dsfmt_t *dsfmt) {
    return dsfmt_genrand_close1_open2(dsfmt) - 1.0;
}

Block generation can be faster for many reasons, including cache locality, less function calls, loop unrolling and so on. In practice, block operations are often faster than individual operations combined.

In this particular case, block generation is also faster because numbers are generated in pairs (W128_T type):

union W128_T {
    __m128i si;
    __m128d sd;
    uint64_t u[2];
    uint32_t u32[4];
    double d[2];
};

Block version makes use of this property, and copies both numbers from W128_T into result array. Sequential version only uses the first number and discards the second.

As for your third question, use block generation, as it proved to be faster on your computer. You have 1e8 numbers per 100ms, so for 1e12 you need about twenty minutes. If it's okay for you, then just use NUM_RANDS block size, there shouldn't be much difference for any reasonable block size. Otherwise, consider generating numbers from independent generators in multiple threads.