Trying to find the optimal way to combine slices into a database key.
I need a slice that contains three things (concatenated):
I've tried benchmarking the following 3 functions. I would have expected "slot_key_array" to be the fastest, since its not allocating a vector, but "slot_key_sized" is consistently the fastest. Why?
#[inline]
pub fn slot_key_vec(address: &[u8], slot: &[u8]) -> Vec<u8> {
let mut key: Vec<u8> = vec![DB_PREFIX_SLOT];
key.extend_from_slice(address);
key.extend_from_slice(slot);
key
}
#[inline]
pub fn slot_key_sized(address: &[u8], slot: &[u8]) -> Vec<u8> {
let mut key: Vec<u8> = Vec::with_capacity(1 + 20 + 32);
key.push(DB_PREFIX_SLOT);
key.extend_from_slice(address);
key.extend_from_slice(slot);
key
}
#[inline]
pub fn slot_key_array(address: &[u8], slot: &[u8]) -> [u8; 53] {
let key: [u8; 53] = [
DB_PREFIX_SLOT,
address[0], address[1], address[2], address[3], address[4], address[5], address[6], address[7],
address[8], address[9], address[10], address[11], address[12], address[13], address[14], address[15],
address[16], address[17], address[18], address[19],
slot[0], slot[1], slot[2], slot[3], slot[4], slot[5], slot[6], slot[7],
slot[8], slot[9], slot[10], slot[11], slot[12], slot[13], slot[14], slot[15],
slot[16], slot[17], slot[18], slot[19], slot[20], slot[21], slot[22], slot[23],
slot[24], slot[25], slot[26], slot[27], slot[28], slot[29], slot[30], slot[31],
];
key
}
You are probably on Linux, where the default allocator is good (I benchmarked on Windows, by replacing the allocator to mimalloc
I got the same results).
This is because the array versions need to do bounds checking for each element, which prevents vectorizing and adds overhead. But they don't have to: if we make the arguments arrays instead of slices, the bounds checking will be gone.
pub fn slot_key_array_arrays(address: &[u8; 20], slot: &[u8; 32]) -> [u8; 53] {
// ...
}
If you only have slices, you can convert them to arrays with try_into()
:
#[inline]
pub fn slot_key_copy_array_try_into(address: &[u8], slot: &[u8]) -> [u8; 53] {
slot_key_array_arrays(address.try_into().unwrap(), slot.try_into().unwrap())
}
Another alternative is to use copy_from_slice()
:
#[inline]
pub fn slot_key_copy_from_slice(address: &[u8], slot: &[u8]) -> [u8; 53] {
let mut key: [u8; 53] = [0; 53];
key[0] = DB_PREFIX_SLOT;
key[1..][..20].copy_from_slice(address);
key[21..].copy_from_slice(slot);
key
}
Benchmark:
copy_from_slice time: [6.8321 ns 6.8841 ns 6.9611 ns]
Found 13 outliers among 100 measurements (13.00%)
4 (4.00%) high mild
9 (9.00%) high severe
array_try_into time: [4.6906 ns 4.7001 ns 4.7119 ns]
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe
array_arrays time: [4.8212 ns 4.8325 ns 4.8473 ns]
Found 15 outliers among 100 measurements (15.00%)
4 (4.00%) high mild
11 (11.00%) high severe
array time: [12.963 ns 12.992 ns 13.029 ns]
Found 17 outliers among 100 measurements (17.00%)
5 (5.00%) high mild
12 (12.00%) high severe
sized time: [10.116 ns 10.218 ns 10.327 ns]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
vec time: [25.602 ns 27.162 ns 29.211 ns]
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
So, your version with arrays instead of slice is the fastest together with try_into()
(same performance), then copy_from_slice()
, and then the others.
Yet another alternative is to help LLVM eliminate the bound checks. It has problems here because you index in increasing order: we check for index 0, but that means nothing about index 1, so we check it too, then 2, then 3... but if we check the last index first, LLVM knows that all previous indices also exist:
#[inline]
pub fn slot_key_array(address: &[u8], slot: &[u8]) -> [u8; 53] {
address[19];
slot[31];
// ...
}
This brings this code on par with the arrays version:
array time: [4.6107 ns 4.6164 ns 4.6238 ns]
change: [-65.325% -64.755% -64.411%] (p = 0.00 < 0.05)
Performance has improved.