I am investigating slowness in a WebAssembly project, and I wonder if SIMD instructions are being emulated somehow. Here's a toy Rust library to exercise some SIMD operations:
use core::arch::wasm32::*;
#[no_mangle]
pub fn do_something(f: f32) -> f32 {
let f4 = f32x4_splat(f);
let mut a = f4;
for _ in 0..100000 {
a = f32x4_add(a, f4);
}
f32x4_extract_lane::<0>(a)
+ f32x4_extract_lane::<1>(a)
+ f32x4_extract_lane::<2>(a)
+ f32x4_extract_lane::<3>(a)
}
Then I build it with cargo build --release --target wasm32-unknown-unknown
.
Finally I run it with:
const response = await fetch(WASM_FILE);
const wasmBuffer = await response.arrayBuffer();
const wasmObj = await WebAssembly.instantiate(wasmBuffer, {env:{}});
function do_something() {
wasmObj.instance.exports.do_something(0.00001);
requestAnimationFrame(do_something);
}
requestAnimationFrame(do_something);
I suspect that the SIMD operations are being emulated, because I see this is the Chrome performance Call Tree:
If the SIMD operations were being lowered to a single instruction--like I would expect--I wouldn't expect to see anything with f32x4_add
in the profile trace.
It's a well known pitfall that if you don't enable the appropriate target_feature
for SIMD functions, they're not inlined, causing a major overhead. It's even documented.
The solution is to turn the simd128
target feature on. You can do it by setting RUSTFLAGS
(for example, in .cargo/config.toml
), and pass the argument -C target-feature=+simd128
to rustc.