I have been implementing various node based binary search trees using C-ish C++ code. When benchmarking these I have noticed surprisingly large performance variations both across compilers and in response to small code changes.
When I focused on insertion and removal in a tree that allowed duplicates (as a C++ std::multiset<int>
would), I found that almost all the time is spent zig-zagging down the tree's left and right pointers in operations like "find" and "lower_bound" rather than the conceptually "expensive" rebalancing steps that occur after inserts and deletes.
So I began to focus on one case in particular: lower bound.
// Node is a binary tree node. It has the
// usual left and right links and an
// integral key.
struct Node {
int key;
Node* links[2];
};
// LowerBound returns the first node in
// the tree rooted at "x" whose key is
// not less than "key", or null if there
// is no such key.
Node* LowerBound(Node* x, int key) {
Node* lower = nullptr;
while (x != nullptr) {
bool x_gte = !(x->key < key);
lower = x_gte ? x : lower;
x = x->links[!x_gte];
}
return lower;
}
A few points and observations:
cmov
) instructions are faster on AMD than on Intelcmov
or similar, is often much slower.Node links[2]
) in the node instead of explicit left
and right
pointers. With explicit fields gcc is very quick to switch to branchy code, which is slower. With the links
array gcc will index it as I have written.With the code above on godbolt we can see clang generating the following:
LowerBound(Node*, int):
xorl %eax, %eax
testq %rdi, %rdi
je .LBB0_3
.LBB0_1: # =>This Inner Loop Header: Depth=1
xorl %ecx, %ecx
cmpl %esi, (%rdi)
setl %cl
cmovgeq %rdi, %rax
movq 8(%rdi,%rcx,8), %rdi
testq %rdi, %rdi
jne .LBB0_1
.LBB0_3:
retq
while gcc is doing worse:
LowerBound(Node*, int):
xorl %eax, %eax
testq %rdi, %rdi
je .L5
.L4:
cmpl %esi, (%rdi)
setl %dl
cmovge %rdi, %rax
movzbl %dl, %edx
movq 8(%rdi,%rdx,8), %rdi
testq %rdi, %rdi
jne .L4
ret
.L5:
ret
The gcc variant is roughly 2x slower on my machine (the geomean of the timings with tree heights 1 to 18). Can this be explained in a simple way? I notice that clang is clearing %ecx
first, then sets %cl
, then uses %ecx
, whereas gcc sets %dl
and then moves it to %edx
before using %rdx
.
gcc's approach is equivalent logically, much slower in practice. Can it be improved?
Using llvm-mca, which is a tool from the LLVM suite to analyze the machine code for a given architecture, we can see that indeed there is a difference.
For the Intel Skylake architecture the code generated by GCC versus LLVM:
Instructions: 1200 vs 1200
Total Cycles: 1305 vs 1205
Total uOps: 1700 vs 1400
For the AMD Zen3 architecture the code generated by GCC versus LLVM:
Instructions: 1200 vs 1100
Total Cycles: 1205 vs 1105
Total uOps: 1200 vs 1100
The average wait times for GCC were 20% higher
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 3 0.0 0.0 12.0 xorl %eax, %eax
1. 3 11.0 0.3 0.7 testq %rdi, %rdi
2. 3 12.0 0.0 0.0 je .L5
3. 3 11.0 0.3 0.0 cmpl %esi, (%rdi)
4. 3 16.0 0.0 0.0 setl %dl
5. 3 17.0 0.0 0.0 movzbl %dl, %edx
6. 3 15.0 0.0 1.0 cmovgeq %rdi, %rax
7. 3 17.0 0.0 0.0 movq 8(%rdi,%rdx,8), %rdi
8. 3 22.0 0.0 0.0 testq %rdi, %rdi
9. 3 23.0 0.0 0.0 jne .L4
10. 3 1.0 1.0 18.0 retq
11. 3 1.7 1.7 17.3 retq
3 12.2 0.3 4.1 <total>
Against the code generated by LLVM
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 3 0.0 0.0 11.7 xorl %eax, %eax
1. 3 10.3 0.3 0.7 testq %rdi, %rdi
2. 3 11.0 0.0 0.0 je .LBB0_3
3. 3 0.0 0.0 12.0 xorl %ecx, %ecx
4. 3 10.0 0.3 0.0 cmpl %esi, (%rdi)
5. 3 15.0 0.0 0.0 setl %cl
6. 3 14.7 0.0 0.0 cmovgeq %rdi, %rax
7. 3 15.3 0.0 0.0 movq 8(%rdi,%rcx,8), %rdi
8. 3 20.0 0.0 0.0 testq %rdi, %rdi
9. 3 21.0 0.0 0.0 jne .LBB0_1
10. 3 1.0 1.0 16.0 retq
3 10.8 0.2 3.7 <total>
We can see also that the resource pressure per iteration on GCC is much higher
Resources:
[0] - Zn3AGU0
[1] - Zn3AGU1
[2] - Zn3AGU2
[3] - Zn3ALU0
[4] - Zn3ALU1
[5] - Zn3ALU2
[6] - Zn3ALU3
[7] - Zn3BRU1
[14.0] - Zn3LSU
[14.1] - Zn3LSU
[14.2] - Zn3LSU
[15.0] - Zn3Load
[15.1] - Zn3Load
[15.2] - Zn3Load
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7]
1.33 1.33 1.34 3.33 1.35 1.65 2.65 2.02
[14.0] [14.1] [14.2] [15.0] [15.1] [15.2]
1.33 1.33 1.34 1.33 1.33 1.34
Against LLVM
[0] [1] [2] [3] [4] [5] [6] [7]
1.00 1.00 1.00 2.55 0.99 1.01 2.50 1.95
[14.0] [14.1] [14.2] [15.0] [15.1] [15.2]
1.00 1.00 1.00 1.00 1.00 1.00
It looks like the LLVM compiler does a much better job of optimizing the pipeline pressure.
If you are interested in only certain portions of the execution as the inner loop, you can mark the regions to be analized as in
Node* LowerBound(Node* x, int key) {
Node* lower = nullptr;
while (x != nullptr) {
__asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
bool x_gte = !(x->key < key);
lower = x_gte ? x : lower;
x = x->links[!x_gte];
__asm volatile("# LLVM-MCA-END foo":::"memory");
}
return lower;
}
This brings total cycles to 1303 for GCC and 1203 for LLVM.
Compiler Explorer: https://godbolt.org/z/8KoKfab34