rust benchmarking cpu-cache false-sharing

Can't reproduce false cache line sharing problem in Rust

I'm trying to reproduce example 6 of the Gallery of Processor Cache Effects.

The article gives this function (in C#) as an example how to test false sharing:

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}

If we create threads passing to this function 0, 1, 2, 3 arguments it will take the computation a long time to finish (author got 4.3 seconds). If we pass, for instance, 16, 32, 48, 64, we'll get much nicer results (0.28 seconds).

I've come up with the following function in Rust:

pub fn cache_line_sharing(arr: [i32; 128], pos: usize) -> (i32, i32) {
    let arr = Arc::new(arr);
    let handles: Vec<_> = (0..4).map(|thread_number| {
        let arr = arr.clone();
        let pos = thread_number * pos;
        thread::spawn(move || unsafe {
            let p = (arr.as_ptr() as *mut i32).offset(pos as isize);
            for _ in 0..1_000_000 {
                *p = (*p).wrapping_add(3);
            }
        })
    }).collect();

    for handle in handles {
        handle.join().unwrap();
    }

    (arr[0], arr[1])
}

Benchmarking it with two sets of arguments (0, 1, 2, 3 and 0, 16, 32, 48) gives me almost identical results: 108.34 and 105.07 microseconds.

I use the criterion crate for the benchmarks. I have a MacBook Pro 2015 with Intel i5-5257U CPU (2.70GHz). My system reports to have 64B cache line size.

If anybody wants to see my full benchmarking code, here are the links: - lib.rs - cache_lines.rs

I want to understand the problem and find the way to reproduce similar results as in the article.

Solution

Your first problem is that *p.wrapping_add(3) does arithmetic on the pointer, not on the integer. The first iteration of the loop was loading the value three spaces after p and storing it at p, and Rust was optimizing away the other 999999 iterations of the loop as redundant. You meant (*p).wrapping_add(3).

After that change, Rust optimizes the 1000000 additions by 3 into one addition by 3000000. You can use read_volatile and write_volatile to avoid that optimization.

Although these two changes are sufficient demonstrate the effect you’re looking for in my test, note that using unsafe operations to mutate an immutably borrowed array is undefined behavior. Rust is allowed to optimize under the assumption that unsafe code upholds certain invariants, which this code does not, so Rust would be entirely within its rights to replace your code with anything it feels like.

You presumably used immutable borrows to get around the restriction on copying mutable references and mutable pointers between threads. Here’s a less undefined way to get around that restriction, I think (although honestly, I won’t be too surprised if someone replies to point out some way in which this is still wrong).

pub fn cache_line_sharing(arr: [i32; 128], pos: usize) -> (i32, i32) {
    struct SyncWrapper(UnsafeCell<[i32; 128]>);
    unsafe impl Sync for SyncWrapper {}

    assert_ne!(pos, 0);
    let arr = Arc::new(SyncWrapper(UnsafeCell::new(arr)));
    let handles: Vec<_> = (0..4)
        .map(|thread_number| {
            let arr = arr.clone();
            let pos = thread_number * pos;
            thread::spawn(move || unsafe {
                let p: *mut i32 = &mut (*arr.0.get())[pos];
                for _ in 0..1_000_000 {
                    p.write_volatile(p.read_volatile().wrapping_add(3));
                }
            })
        })
        .collect();

    for handle in handles {
        handle.join().unwrap();
    }

    let arr = unsafe { *arr.0.get() };
    (arr[0], arr[1])
}