How is Rust --release build slower than Go?

I'm trying to learn about Rust's concurrency and parallel computing and threw together a small script that iterates over a vector of vectors like it was an image's pixels. Since at first I was trying to see how much faster it gets iter vs par_iter I threw in a basic timer -- which is probably not amazingly accurate. However, I was getting crazy high numbers. So, I thought I would put together a similar piece of code on Go that allows for easy concurrency and the performance is ~585% faster!

Rust was tested with --release

I also tried using native thread pool but the results were the same. Looked at how many threads I was using and for a bit I was messing around with that as well, to no avail.

What am I doing wrong? (don't mind the definitely not performant way of creating a random value filled vector of vectors)

Rust code (~140ms)

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;

fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}

fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

    // Time starts now.
    let now = Instant::now();

    let chunk_size = 300_000;

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed: {:.2?}", elapsed);
}

Go code (~24ms)

package main

import (
    "fmt"
    "math/rand"
    "sync"
    "time"
)

func normalise(value uint16, min uint16, max uint16) float32 {
    return float32(value-min) / float32(max-min)
}

func main() {
    const pixelSize = 9000000
    var fakeImage [][]uint16

    // Create a new random number generator
    src := rand.NewSource(time.Now().UnixNano())
    rng := rand.New(src)

    for i := 0; i < pixelSize; i++ {
        var pixel []uint16
        for j := 0; j < 4; j++ {
            pixel = append(pixel, uint16(rng.Intn(1<<16)))
        }
        fakeImage = append(fakeImage, pixel)
    }

    normalised_image := make([][4]float32, pixelSize)
    var wg sync.WaitGroup

    // Time starts now
    now := time.Now()
    chunkSize := 300_000
    numChunks := pixelSize / chunkSize
    if pixelSize%chunkSize != 0 {
        numChunks++
    }

    for i := 0; i < numChunks; i++ {
        wg.Add(1)

        go func(i int) {
            // Loop through the pixels in the chunk
            for j := i * chunkSize; j < (i+1)*chunkSize && j < pixelSize; j++ {
                // Normalise the pixel values
                _r := normalise(fakeImage[j][0], 0, ^uint16(0))
                _g := normalise(fakeImage[j][1], 0, ^uint16(0))
                _b := normalise(fakeImage[j][2], 0, ^uint16(0))
                _a := normalise(fakeImage[j][3], 0, ^uint16(0))

                // Set the pixel values
                normalised_image[j][0] = _r
                normalised_image[j][1] = _g
                normalised_image[j][2] = _b
                normalised_image[j][3] = _a
            }

            wg.Done()
        }(i)
    }

    wg.Wait()

    elapsed := time.Since(now)
    fmt.Println("Time taken:", elapsed)
}

Solution

The most important initial change for speeding up your Rust code is using the correct type. In Go, you use a [4]float32 to represent an RBGA quadruple, while in Rust you use a Vec<f32>. The correct type to use for performance is [f32; 4], which is an array known to contain exactly 4 floats. An array with known size need not be heap-allocated, while a Vec is always heap-allocated. This improves your performance drastically - on my machine, it's a factor of 8 difference.

Original snippet:

    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

... 

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

New snippet:

    let fake_image: Vec<[u16; 4]> = (0..pixel_size).map(|_| {
    let mut result: [u16; 4] = Default::default();
    result.fill_with(|| rand::thread_rng().gen_range(0..=u16::MAX));
    result
    }).collect();

...

    let _normalised_image: Vec<Vec<[f32; 4]>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<[f32; 4]> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            [r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

On my machine, this results in a roughly 7.7x speedup, bringing Rust and Go roughly to parity. The overhead of doing a heap allocation for every single quadruple slowed Rust down drastically and drowned out everything else; eliminating this puts Rust and Go on more even footing.

Second, there is a slight error in your Go code. In your Rust code, you calculate a normalized r, g, b, and a, while in your Go code, you only calculate _r, _g, and _b. I don't have Go installed on my machine, but I imagine this gives Go a slight unfair advantage over Rust, since you're doing less work.

Third, you are still not quite doing the same thing in Rust and Go. In Rust, you split the original image into chunks and, for each chunk, produce a Vec<[f32; 4]>. This means you still have a bunch of chunks sitting around in memory that you'll later have to combine into a single final image. In Go, you split the original chunks and, for each chunk, write the chunk into a common array. We can rewrite your Rust code further to perfectly mimic the Go code. Here is what this looks like in Rust:

let _normalized_image: Vec<[f32; 4]> = {
    let mut destination = vec![[0 as f32; 4]; pixel_size];
    
    fake_image
        .par_chunks(chunk_size)
        // The "zip" function allows us to iterate over a chunk of the input 
        // array together with a chunk of the destination array.
        .zip(destination.par_chunks_mut(chunk_size))
        .for_each(|(i_chunk, d_chunk)| {
        // Sanity check: the chunks should be of equal length.
        assert!(i_chunk.len() == d_chunk.len());
        for (i, d) in i_chunk.iter().zip(d_chunk) {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            *d = [r, g, b, a];

            // Alternately, we could do the following loop:
            // for j in 0..4 {
            //  d[j] = normalise(i[j], 0, u16::MAX);
            // }
        }
    });
    destination
};

Now your Rust code and your Go code are truly doing the same thing. I suspect you'll find the Rust code is slightly faster.

Finally, if you were doing this in real life, the first thing you should try would be using map as follows:

    let _normalized_image = fake_image.par_iter().map(|&[r, b, g, a]| {
    [ normalise(r, 0, u16::MAX),
      normalise(b, 0, u16::MAX),
      normalise(g, 0, u16::MAX),
      normalise(a, 0, u16::MAX),
      ]
    }).collect::<Vec<_>>();

This is just as fast as manually chunking on my machine.