performance sorting rust benchmarking selection-sort

Why is Rust's default sort function slightly slower than my Selection Sort for small arrays?

I'm quite new to Rust, so I might be missing something simple. I am using Rust 1.70.0-nightly. Here is the necessary code:

fn selection_sort(original_vec: &mut Vec<i32>) -> Vec<i32> {
    for i in 0..original_vec.len()-1 {
        let mut smallest: usize = i;
        for j in i+1..original_vec.len() {
            if original_vec[j] < original_vec[smallest] {
                smallest = j;
            }
        }
        if smallest != i {
            original_vec.swap(i, smallest);
        }
    };
    original_vec.to_vec()
}

// helper function for testing (uses builtin sort function)
fn rust_sort<A, T>(mut array: A) -> A
where
    A: AsMut<[T]>,
    T: Ord,
{
    let slice = array.as_mut();
    slice.sort();

    array
}

const TEST_VECS: [[i32; 10]; 6] = [
    [1, 3, 2, 9, 6, 7, 4, 10, 8, 5],
    [1, 2, 7, 2, 9, 9, 7, 10, 2, 1],
    [0, 4, 1, 3, 9, 12, 3, 0, 13, 8],
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    [-1, -5, -10, 1, 10, 2, -123, -34, 0, -32],
    [i32::MAX, -32984, i32::MIN, 31648, 5349857, -30954, -2343285, 0, 1, i32::MIN],
];

#[bench]
fn bench_rust_sort(b: &mut Bencher) {
    b.iter(|| {
        for i in TEST_VECS {
            (&mut i.to_vec()).sort()
        }
    })
}

#[bench]
fn bench_selection_sort(b: &mut Bencher) {
    b.iter(|| {
        for i in TEST_VECS {
            selection_sort(&mut i.to_vec());
        }
    })
}

When I run cargo bench:

$ cargo bench
  Compiling rust-algs v0.1.0 (/home/josia/projects/rust-algs)
    Finished bench [optimized] target(s) in 0.25s
     Running unittests src/lib.rs (target/release/deps/rustalgs-ae260c07593c3aad)

running 3 tests
test test_selection_sort ... ignored
test bench_rust_sort      ... bench:         106 ns/iter (+/- 8)
test bench_selection_sort ... bench:         102 ns/iter (+/- 9)

test result: ok. 0 passed; 0 failed; 1 ignored; 2 measured; 0 filtered out; finished in 7.65s

I tried this many times, and even renamed the testing function to change the order of testing. No matter what, my custom selection sort function still performs faster. I'm guessing the problem lies within the fact that I had to call a function to wrap the main default sort function. Calling the actual function wouldn't work, because even if I clone the TEST_VECS constant in the benching function as a vector, the sorting function would just keep sorting it, which won't let the other benching iterations to sort it. If I cloned the constant within the benching closure, it would affect the bench iteration's performance greatly and I wouldn't be able to just benchmark the code I'm trying to run.

Is there a problem with the way I am benchmarking these functions, or is my custom function somehow just faster?

Solution

TL:DR:

Rustc -C opt-level=3 fully inlines and fully unrolls your selection_sort for this compile-time constant size=10. With opt-level=2 it inlines but compiles to a loop like the source.
Rust's .sort uses Insertion Sort for sizes this small, but for some reason it doesn't inline. So that's actually running a not-unrolled loop with a size that's a runtime-variable (as far as it knows). Note the caller doing mov esi, 10 / ... / call core::slice::sort::insertion_sort_shift_left. I think this is the key thing that's making .sort slightly slower.
Copying from TEST_VECS is fine although maybe a bit small; modern CPUs might be learning the pattern of branching with the branch predictors and doing better than you'd expect as part of a larger program for unsorted inputs of that size. You could have more than 6 slices and still not be getting and L1d cache misses.

It would be better to avoid alloc/dealloc inside the repeat loop. (Use .clone_from_slice (as described here) into a slice or vector that gets allocated outside the repeat loop). Both loops are paying for the same alloc/dealloc overhead, and that should balance out. The compiler already does the actual copying efficiently here.

Rust's .sort doesn't internally alloc anything for slice sizes this small (using Insertion Sort), and passing/returning a Vec optimizes away to just in-place sorting, no extra copying.
black_box could make this more robust against potential optimizations that defeat the repeat-loop, but it looks like that isn't happening with current versions.
Selection Sort has O(N^2) time complexity (best and worst case). Insertion Sort is O(N) for already sorted arrays, which is one reason it's the standard choice for small problems (and sub-problems of divide-and-conquer algorithms for larger arrays), but in general is O(N^2) average and worst case. For large problems, all O(N^2) sorts are garbage, so .sort and .sort_unstable use O(N log N) algorithms that are much faster than selection_sort. Murasame's answer measured a factor of 15 speed difference for size=1000. (https://doc.rust-lang.org/std/vec/struct.Vec.html#method.sort / https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)
If your CPU is Skylake-family, you should look for a Rust equivalent of How can I mitigate the impact of the Intel jcc erratum on gcc? - without controlling for that factor, one microbenchmark might get bitten by it while another doesn't.

So we can say that Selection Sort compiles to faster asm with current Rustc with -C opt-level=3, for small compile-time-constant Vecs that aren't (close to) already sorted. Based on what we're seeing in the asm as the apparent reason for these results, I expect it would be different if any of those factors were different. Perhaps even different tuning defaults for different ISAs could make it not fully unroll. Also, it's somewhat important that mainstream x86-64 CPUs can handle such large loops fully efficiently (via a large-enough uop cache).

`selection_sort` inlined and fully unrolled, `.sort` didn't

Your measurements are probably correct for this specific case, of size=10 as compile-time constant that the compiler can see. rustc -C opt-level=3 is fully inlining and fully unrolling your selection_sort, but emitting a call to the standard library's Insertion Sort. So .sort() includes the loop overhead of dealing with a runtime-variable slice size.

(I don't know how to get the Matt Godbolt's compiler explorer to show me asm for the actual #[bench] functions, so I made my own simple counted-loop caller on the assumption that inlining choices would be basically the same into b.iter(|| { ... }) loops. Godbolt - change which call is uncommented to see asm for the other one.)

Normally Insertion Sort is just as good as Selection Sort for unsorted data like yours (both N^2 with similar constant factor), but it does have more data movement which maybe doesn't lend itself as well to fully unrolling for a compile-time-constant .len(). Or some difference in the way it comes from the standard library vs. your selection_sort being simple and fully there in the source file, and written for i32 specifically? But .sort does inline and optimize away the logic to choose insertion_sort_shift_left without any of the large-slice code still being present, so nothing is inherently stopping the compiler, it just chose not to.

Anyway, I think fully unrolling the outer and inner loops is what's making selection_sort fast here. After copying a TEST_VEC into a newly-allocated Vec, the asm goes right into comparing, like

# rustc nightly -C opt-level=3  for x86-64
# inside the main repeat loop in  bench_selection_sort
...
        mov     edi, 40
        mov     esi, 4
        call    rbp              # call alloc
        test    rax, rax
        je      .LBB0_4          # check for null
# pointer to a slice of TEST_VECS in R15
# pointer to the newly-allocated Vec<i32> in RAX
        movups  xmm0, xmmword ptr [r15]
        movups  xmm1, xmmword ptr [r15 + 16]
        movups  xmmword ptr [rax], xmm0          # with -C target-cpu=x86-64-v3
        movups  xmmword ptr [rax + 16], xmm1     # just one YMM 32-byte copy is needed, vs. 2x 16-byte with baselien SSE2
        mov     rdx, qword ptr [r15 + 32]
        mov     qword ptr [rax + 32], rdx      # copy last 8 of the 40 bytes
# and now the sorting
        mov     ecx, dword ptr [rax]
        mov     edi, dword ptr [rax + 8]
        xor     esi, esi
        cmp     dword ptr [rax + 4], ecx
        setl    sil                  # 0 or 1 according to vec[1] < vec[0]
        mov     r9d, 2
        cmp     edi, dword ptr [rax + 4*rsi]   # compare at a data-dependent address; surprising for Selection Sort
        jl      .LBB0_9
        mov     r9, rsi            # strange that this isn't a CMOV; the only branch to .LBB0_9 is the preceding line
.LBB0_9:
        mov     esi, dword ptr [rax + 28]   # load several elements
        mov     edi, dword ptr [rax + 24]   # that it's going to cmp/branch on later
        mov     r10d, dword ptr [rax + 16]
        mov     r8d, dword ptr [rax + 20]
        mov     ebx, dword ptr [rax + 12]
        mov     r11d, 3
        cmp     ebx, dword ptr [rax + 4*r9]
        jge     .LBB0_10
  ... over 400 more instructions before the bottom of the loop, mostly mov / cmp/jcc

Test data

Sorting the same data repeatedly is more or less fine, although "only" 6x 10-element sorts might be small enough for your CPU to learn the pattern of branching. (Modern IT-TAGE predictors use previous history to index a prediction make it possible to predict patterns for the same branch, so this doesn't necessarily give the fully-unrolled selection_sort a big advantage, especially since you're sorting 6 different slices so even full unrolling doesn't make a branch go the same way every time.) Your testing with perf stat finding only about 1.5% mispredict rate seems lowish, but that is including loop branches. Try making the arrays a bit longer and/or adding another few.

A few years ago, I was playing with a hand-written asm Bubble Sort from a code-golf exercise, and my Skylake was able to learn its pattern of branching for input sizes up to about 14 elements IIRC, near-zero mispredicts until I made the size larger. And that's Bubble Sort, one of the worst O(N^2) sorts, plus its loop branching. I was benchmarking it with a caller that just did a couple vmovdqa ymm stores from vector regs, which is the cheapest possible way to re-generate a small input for an in-place sort. Rust is loading that data each repeat-loop iteration since there are 6 different sources, but that's only negligibly more expensive. Scalar loads inside the sort function can load right away without even waiting for the vector stores to commit to L1d cache; modern x86 CPUs have good store-to-load forwarding for scalar reloads of vector stores. Everything hits in L1d cache.

Another option would be a vectorized xorshift+ like in What's the fastest way to generate a 1 GB text file containing random digits? to get fresh data to sort every iteration. Using the sorted array as the seed for an LFSR (like Murasame's answer) would be a way to benchmark latency instead of throughput, but will also have a store-forwarding stall when you do a SIMD vector load of the array that was just written by scalar stores. (Assuming the compiler auto-vectorizes your loop.)

O(N^2) sorts like Selection Sort are slow on bigger arrays

I assume you already know this, but Selection Sort has O(N^2) time complexity so it's garbage for larger sizes. Rust's .sort and .sort_unstable use O(N log N) sorts for larger problems (iterative merge "inspired by timsort", or pattern-defeating quicksort respectively), but use Insertion Sort for small sizes, or sub-problems of the larger sorts. Here the size is a compile-time constant and Rust's .sort function can inline enough to optimize away branching on the size to pick a strategy, so it's just your selection_sort vs. the standard library's Insertion Sort (core::slice::sort::insertion_sort_shift_left) for problems of this size.

(Surprisingly, .sort_unstable inlines less (Godbolt), calling core::slice::sort::recurse which checks the size at run-time for being less than 21, otherwise continuing into its quicksort.)

Insertion Sort and Selection Sort do well for small arrays where more complex algorithms would spend too much time sub-dividing an already-small problem. That's why real-world sorts use Insertion Sort once they've sub-divided and partially-sorted enough to create small sub-problems. (Or a sorting network, perhaps using SIMD vectors.)

Stable sorting is irrelevant when the objects being sorted are simple integers, not a struct with a key and some other data. Equal i32 objects are indistinguishable. But it seems .sort doesn't try to detect that since we get different asm from .sort and .sort_unstable.

`black_box`

You don't do anything to stop the sort from being optimized away; the resulting vector is unused. But it appears current versions of rustc -C opt-level=3 miss that optimization so your benchmark appears valid.

Using black_box(sort_result) would be safer / future-proof, although if the compiler did fully see through your benchmark it could just copy from already-sorted arrays to satisfy the black_box requirement to materialize a sort result somewhere. So to be fully safe, you'd use black_box on the input slice or vector and then on the output; hopefully black_box makes the compiler forget what values are there, i.e. treat it like a function call that might have modified the slice, not just read it.

And avoiding alloc/dealloc

The copying is super cheap, just 2 vector (movups) and 1 scalar (mov rdx, [r15 + 32] / mov [rax+32], rdx) load+store.

But the alloc/dealloc could be avoided with a better benchmark. I was wrong about this in comments; I was looking for a call some_longer_name but actually Rustc likes to load library function pointers into registers (if they're going to be called inside loops) so you have to watch for the call mnemonic to spot instructions like call rbp.

How can I copy a vector to another location and reuse the existing allocated memory? describes how to use .clone_from or .clone_from_slice() to copy the data from a slice into an already-allocated Vec. I looked at the asm to confirm that there were no call instructions inside the for i in TEST_VECS { loop, unlike the original.

use core::hint::black_box;

// I don't know how to get it to show asm for the #[bench] version
pub
fn xbench_selection_sort(a: i32) {
    // allocate correct amount of storage once, and redundantly copy into it because I don't know Rust very well
    let mut test_vec = TEST_VECS[0].to_vec();
    for _b in 0..a {  // just a counted repeat loop
        for i in TEST_VECS {
            //let mut test_vec = i.to_vec();  // would alloc/free inside the loop
            test_vec.clone_from_slice(&i);    // copy into already-allocated space
            black_box(&mut test_vec.as_slice());  // force the data (but not the pointers and size) to exist in memory
            selection_sort(&mut test_vec);
            black_box(&mut test_vec.as_slice());  // force the result data to exist.  Apparently costs at least 1 extra asm instruction somewhere, unfortunately.
     // actually this does store the pointers+length to the stack, not avoiding it like I hoped.

            //black_box(&mut test_vec[4]);   // or just ask for one element; often enough to stop a compiler from optimizing away a whole loop.
        }
    }
}

(Godbolt)

If I understand correctly .as_slice() on a Vec is like C++ .data() on a std::vector, except it's a slice so it has a length. Anyway, the idea is to avoid asking the compiler to actually store pointer and length to stack space. But I'm not sure that worked: I am still seeing stores like mov [rsp + 24], r14 ; mov qword ptr [rsp + 32], 10 etc.

Using Vec in the first place is unnecessary; for these small sizes, just a slice in stack space would be fine, and probably would have avoided any alloc/dealloc overhead even if we had been cloning from constant slices since the compiler would know it could reuse stack space.

This function does a 240-byte memcpy of TEST_VECS to stack space every outer iteration. (It copies from there to the Vec object.) I don't know why that's happening, but I barely know Rust. It happens without any use of black_box so it's not being triggered by that.