Search code examples
coptimizationrustmodulo

Optimising the 64-bit modulus operation, for a repeatedly used runtime modulo


I have billions of unsigned 64-bit numbers (x) for which I need to find x % m. m is the same for all, but is not a compile-time constant. m is greater than one.

Both x and m are equally likely to hold any 64-bit value.

(If m were a compile-time constant, I could just trust the compiler to do its best.)

As m is the same for all of these operations, and I have storage available, is there an optimisation which will let me calculate all the results for x % m faster than if m were different for each x?

(I'm inspired by the division-by-a-constant optimisation which transforms division into multiplication, shifts and addition. Illustrative code in C, Rust or pseudo-code would be great.)

I have looked at the document, but their optimisation of calculating modulus without calculating the quotient only applies to divisors of 2^n +/- 1.


Solution

  • Below is an approach that I will test, as the comments on the question have made it plain that the distribution of x and m are important.

    What I hadn't realised is that it is 50% likely that m > x.

    Untested pseudo-code:

    fn mod(x: u64, m: u64) -> u64 {
        if m > x {
            x
        } else if 2*m > x {
            x - m
        } else {
            x % m
        }
    }
    

    This may well be what the compiler produces. I had not tested.


    Update: I've tested this optimisation on my i5 2500K, processing 1 GiB of data.

    The result is a stable 67% speed increase:

    opt: 713.113926ms
    plain: 1.195687298s
    

    The benchmarking code is:

    use std::time::Instant;
    use rand::Rng;
    
    const ARR_LEN: usize = 128 * 1024;
    const ROUNDS: usize = 1024;
    
    fn plain_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
        for i in 0..ARR_LEN {
            result[i] = x[i] % m;
        }
    }
    
    fn opt_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
        for i in 0..ARR_LEN {
            result[i] = if m > x[i] {
                x[i]
            } else if m > x[i] / 2 {
                x[i] - m
            } else {
                x[i] % m
            }
        }
    }
    
    fn main() {
        // 1 MiB of pseudo-random values x
        let mut rng = rand::thread_rng();
        let mut x = [0u64; ARR_LEN];
        for i in 0..ARR_LEN {
            x[i] = rng.gen();
        }
    
        // 1 KiB of pseudo-random modulii m
        let mut m = [0u64; ROUNDS];
        for r in 0..ROUNDS {
            m[r] = rng.gen(); // there's only a 1 in 2^64 chance that 0 will be generated
        }
    
        // 1 GiB of output each, use Vec to avoid stack overflow
        let mut plain_results = vec![[0u64; ARR_LEN]; ROUNDS];
        let mut opt_results = vec![[0u64; ARR_LEN]; ROUNDS];
    
        // These loops modulus 1GB of data each.
        let start_opt = Instant::now();
        for r in 0..ROUNDS {
            opt_mod_test(&x, m[r], &mut opt_results[r]);
        }
        let stop_opt = Instant::now();
        println!("opt: {:?}", stop_opt - start_opt);
    
        let start_plain = Instant::now();
        for r in 0..ROUNDS {
            plain_mod_test(&x, m[r], &mut plain_results[r]);
        }
        let stop_plain = Instant::now();
        println!("plain: {:?}", stop_plain - start_plain);
    
    
        // Stop the results from being optimised away, by using them.
        let mut plain_sum = 0;
        let mut opt_sum = 0;
        for r in 0..ROUNDS {
            for i in 0..ARR_LEN {
                plain_sum += plain_results[r][i];
                opt_sum += opt_results[r][i];
            }
        }
    
        println!("opt_sum: {:?}", opt_sum);
        println!("plain_sum: {:?}", plain_sum);
    
    }