Search code examples
primessieve-of-eratosthenessieve-algorithm

Sieve of Eratosthenes has huge 'overdraw' - is Sundaram's better after all?


The standard Sieve of Eratosthenes crosses out most composites multiple times; in fact the only ones that do not get marked more than once are those that are the product of exactly two primes. Naturally, the overdraw increases as the sieve gets bigger.

For an odd sieve (i.e. without the evens) the overdraw hits 100% for n = 3,509,227, with 1,503,868 composites and 1,503,868 crossings-out of already crossed-out numbers. For n = 2^32 the overdraw rises to 134.25% (overdraw 2,610,022,328 vs. pop count 1,944,203,427 = (2^32 / 2) - 203,280,221).

The Sieve of Sundaram - with another explanation at maths.org - may be a bit smarter about that, if - and only if - the loop limits are computed intelligently. However, that's something which the sources I've seen seem to gloss over as 'optimisation', and it also seems that an unoptimised Sundaram gets beaten by an odds-only Eratosthenes every time.

The interesting things is that both create exactly the same final bitmap, i.e. one where bit k corresponds to the number (2 * k + 1). So both algorithms must end up setting exactly the same bits, they just have different ways of going about it.

Does someone have hands-on experience with a competitive, tuned Sundaram? Can it beat the old Greek?

I have slimmed the code for my small factor sieve (2^32, an odds-only Greek) and tuned the segment size to 256 KB, which is as optimal on the old Nehalem with its 256 KB L2 as on the newer CPUs (even though the latter are much more forgiving about bigger segments). But now I have hit a brick wall and the bloody sieve still takes 8.5 s to initialise. Loading the sieve from hard disk is not a very attractive option, and multi-threading is difficult to do in a portable manner (since libs like boost tend to put a monkey wrench into portability)...

Can Sundaram shave a few seconds from the startup time?

P.S.: the overdraw as such is not a problem and will be absorbed by the L2 cache. The point is that the standard Eratosthenes seems to do more than double the work than necessary, which indicates that there may be potential for doing less work, faster.


Solution

  • Since there weren't any takers for the 'Sundaram vs. Eratosthenes' problem, I sat down and analysed it. Result: classic Sundaram's has strictly higher overdraw than an odds-only Eratosthenes; if you apply an obvious, small optimisation then the overdraw is exactly the same - for reasons that shall become obvious. If you fix Sundaram's to avoid overdraw entirely then you get something like Pritchard's Sieve, which is massively more complicated.

    The exposition of Sundaram's Sieve in Lucky's Notes is probably the best by far; slightly rewritten to use a hypothetical (i.e. non-standard and not supplied here) type bitmap_t it looks somewhat like this. In order to measure overdraw the bitmap type needs an operation corresponding to the BTS (bit test and set) CPU instruction, which is available via the _bittestandset() intrinsic with Wintel compilers and with MinGW editions of gcc. The intrinsics are very bad for performance but very convenient for counting overdraw.

    Note: for sieving all primes up to N one would call the sieve with max_bit = N/2; if bit i of the resulting bitmap is set then the number (2 * i + 1) is composite. The function has '31' in its name because the index math breaks for bitmaps greater than 2^31; hence this code can only sieve numbers up to 2^32-1 (corresponding to max_bit <= 2^31-1).

    uint64_t Sundaram31_a (bitmap_t &bm, uint32_t max_bit)
    {
       assert( max_bit <= UINT32_MAX / 2 );
    
       uint32_t m = max_bit;
       uint64_t overdraw = 0;
    
       bm.set_all(0);
    
       for (uint32_t i = 1; i < m / 2; ++i)
       {
          for (uint32_t j = i; j <= (m - i) / (2 * i + 1); ++j)
          {
             uint32_t k = i + j + 2 * i * j;
    
             overdraw += bm.bts(k);
          }
       }
    
       return overdraw;
    }
    

    Lucky's bound on j is exact but the one on i is very loose. Tightening it and losing the m alias that I had added to make the code look more like common expositions on the net, we get:

    uint64_t Sundaram31_b (bitmap_t &bm, uint32_t max_bit)
    {
       uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
       uint64_t overdraw = 0;
    
       bm.set_all(0);
    
       for (uint32_t i = 1; i <= i_max; ++i)
       {
          for (uint32_t j = i; j <= (max_bit - i) / (2 * i + 1); ++j)
          {
             uint32_t k = i + j + 2 * i * j;
    
             overdraw += bm.bts(k);
          }
       }
    
       return overdraw;
    }
    

    The assert was dumped in order to reduce noise but it is actually still valid and necessary. Now it's time for a bit of strength reduction, turning multiplication into iterated addition:

    uint64_t Sundaram31_c (bitmap_t &bm, uint32_t max_bit)
    {
       uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
       uint64_t overdraw = 0;
    
       bm.set_all(0);
    
       for (uint32_t i = 1; i <= i_max; ++i)
       {
          uint32_t n = 2 * i + 1;
          uint32_t k = n * i + i;   // <= max_bit because that's how we computed i_max
          uint32_t j_max = (max_bit - i) / n;
    
          for (uint32_t j = i; j <= j_max; ++j, k += n)
          {
             overdraw += bm.bts(k);
          }
       }
    
       return overdraw;
    }
    

    Transforming the loop condition to use k allows us to lose j; things should be looking exceedingly familiar by now...

    uint64_t Sundaram31_d (bitmap_t &bm, uint32_t max_bit)
    {
       uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
       uint64_t overdraw = 0;
    
       bm.set_all(0);
    
       for (uint32_t i = 1; i <= i_max; ++i)
       {
          uint32_t n = 2 * i + 1;     
          uint32_t k = n * i + i;   // <= max_bit because that's how we computed i_max
    
          for ( ; k <= max_bit; k += n)
          {        
             overdraw += bm.bts(k);
          }
       }
    
       return overdraw;
    }
    

    With things looking the way they do, it's time to analyse whether a certain obvious small change is warranted by the math. The proof is left as an exercise for the reader...

    uint64_t Sundaram31_e (bitmap_t &bm, uint32_t max_bit)
    {
       uint32_t i_max = unsigned(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
       uint64_t overdraw = 0;
    
       bm.set_all(0);
    
       for (uint32_t i = 1; i <= i_max; ++i)
       {
          if (bm.bt(i))  continue;
    
          uint32_t n = 2 * i + 1;     
          uint32_t k = n * i + i;   // <= m because we computed i_max to get this bound
    
          for ( ; k <= max_bit; k += n)
          {        
             overdraw += bm.bts(k);
          }
       }
    
       return overdraw;
    }
    

    The only thing that still differs from classic odds-only Eratosthenes (apart from the name) is the initial value for k, which is normally (n * n) / 2 for the old Greek. However, substituting 2 * i + 1 for n the difference turns out to be 1/2, which rounds to 0. Hence, Sundaram's is odds-only Eratosthenes without the 'optimisation' of skipping composites to avoid at least some crossings-out of already crossed-out numbers. The value for i_max is the same as the Greek's max_factor_bit, just arrived at using completely different logical steps and computed using a marginally different formula.

    P.S.: after seeing overdraw in the code so many times, people will probably want to know what it actually is... Sieving the numbers up to 2^32-1 (i.e. a full 2^31 bit bitmap) Sundaram's has an overdraw of 8,643,678,027 (roughly 2 * 2^32) or 444.6%; with the small fix that turns it into odds-only Eratosthenes the overdraw becomes 2,610,022,328 or 134.2%.