Can gcc emit code as efficient as clang for the binary tree "LowerBound" algorithm?

I have been implementing various node based binary search trees using C-ish C++ code. When benchmarking these I have noticed surprisingly large performance variations both across compilers and in response to small code changes.

When I focused on insertion and removal in a tree that allowed duplicates (as a C++ std::multiset<int> would), I found that almost all the time is spent zig-zagging down the tree's left and right pointers in operations like "find" and "lower_bound" rather than the conceptually "expensive" rebalancing steps that occur after inserts and deletes.

So I began to focus on one case in particular: lower bound.

// Node is a binary tree node.  It has the
// usual left and right links and an
// integral key.
struct Node {
    int key;
    Node* links[2];
};

// LowerBound returns the first node in
// the tree rooted at "x" whose key is
// not less than "key", or null if there
// is no such key.
Node* LowerBound(Node* x, int key) {
  Node* lower = nullptr;
  while (x != nullptr) {
    bool x_gte = !(x->key < key);
    lower = x_gte ? x : lower;
    x = x->links[!x_gte];
  }
  return lower;
}

A few points and observations:

I am on an AMD Ryzen 9 5900X 12-Core. ~~My understanding is that the conditional move (cmov) instructions are faster on AMD than on Intel~~ (my understanding was wrong, see Peter Cordes' comment on this post), but I find that when I spot check results on my 8 year old Intel laptop the code that is faster on AMD is faster on Intel too.
I am running Linux. I've turned off hyperthreading, boost mode, and set the cpu scaling governor to "performance" using this script I wrote. The performance numbers are stable with little variation.
The code above is the end of several optimization iterations. I have a benchmark (code here) that exercises various tree sizes, allocating nodes in an array according to either a random or ascending by key order, then writes a key access pattern to another array, and runs through them repeatedly. The key access patterns are either ascending or random. In larger trees, code that uses branches, rather than cmov or similar, is often much slower.
One key optimization seems to be using an array of links (Node links[2]) in the node instead of explicit left and right pointers. With explicit fields gcc is very quick to switch to branchy code, which is slower. With the links array gcc will index it as I have written.
In fact, when I use gcc's profile guided optimization it still switches to branch based code, for a 1.5x to 2x performance loss.
In all cases, except for very tiny trees where branchy code can win, clang generates faster code for this function.

With the code above on godbolt we can see clang generating the following:

LowerBound(Node*, int):
        xorl    %eax, %eax
        testq   %rdi, %rdi
        je      .LBB0_3
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        xorl    %ecx, %ecx
        cmpl    %esi, (%rdi)
        setl    %cl
        cmovgeq %rdi, %rax
        movq    8(%rdi,%rcx,8), %rdi
        testq   %rdi, %rdi
        jne     .LBB0_1
.LBB0_3:
        retq

while gcc is doing worse:

LowerBound(Node*, int):
        xorl    %eax, %eax
        testq   %rdi, %rdi
        je      .L5
.L4:
        cmpl    %esi, (%rdi)
        setl    %dl
        cmovge  %rdi, %rax
        movzbl  %dl, %edx
        movq    8(%rdi,%rdx,8), %rdi
        testq   %rdi, %rdi
        jne     .L4
        ret
.L5:
        ret

The gcc variant is roughly 2x slower on my machine (the geomean of the timings with tree heights 1 to 18). Can this be explained in a simple way? I notice that clang is clearing %ecx first, then sets %cl, then uses %ecx, whereas gcc sets %dl and then moves it to %edx before using %rdx.

gcc's approach is equivalent logically, much slower in practice. Can it be improved?

Solution

Using llvm-mca, which is a tool from the LLVM suite to analyze the machine code for a given architecture, we can see that indeed there is a difference.

For the Intel Skylake architecture the code generated by GCC versus LLVM:

Instructions:      1200 vs 1200 
Total Cycles:      1305 vs 1205
Total uOps:        1700 vs 1400

For the AMD Zen3 architecture the code generated by GCC versus LLVM:

Instructions:      1200 vs 1100 
Total Cycles:      1205 vs 1105
Total uOps:        1200 vs 1100

The average wait times for GCC were 20% higher

Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     3     0.0    0.0    12.0      xorl   %eax, %eax
1.     3     11.0   0.3    0.7       testq  %rdi, %rdi
2.     3     12.0   0.0    0.0       je .L5
3.     3     11.0   0.3    0.0       cmpl   %esi, (%rdi)
4.     3     16.0   0.0    0.0       setl   %dl
5.     3     17.0   0.0    0.0       movzbl %dl, %edx
6.     3     15.0   0.0    1.0       cmovgeq    %rdi, %rax
7.     3     17.0   0.0    0.0       movq   8(%rdi,%rdx,8), %rdi
8.     3     22.0   0.0    0.0       testq  %rdi, %rdi
9.     3     23.0   0.0    0.0       jne    .L4
10.    3     1.0    1.0    18.0      retq
11.    3     1.7    1.7    17.3      retq
       3     12.2   0.3    4.1       <total>

Against the code generated by LLVM

Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     3     0.0    0.0    11.7      xorl   %eax, %eax
1.     3     10.3   0.3    0.7       testq  %rdi, %rdi
2.     3     11.0   0.0    0.0       je .LBB0_3
3.     3     0.0    0.0    12.0      xorl   %ecx, %ecx
4.     3     10.0   0.3    0.0       cmpl   %esi, (%rdi)
5.     3     15.0   0.0    0.0       setl   %cl
6.     3     14.7   0.0    0.0       cmovgeq    %rdi, %rax
7.     3     15.3   0.0    0.0       movq   8(%rdi,%rcx,8), %rdi
8.     3     20.0   0.0    0.0       testq  %rdi, %rdi
9.     3     21.0   0.0    0.0       jne    .LBB0_1
10.    3     1.0    1.0    16.0      retq
       3     10.8   0.2    3.7       <total>

We can see also that the resource pressure per iteration on GCC is much higher


Resources:
[0]   - Zn3AGU0
[1]   - Zn3AGU1
[2]   - Zn3AGU2
[3]   - Zn3ALU0
[4]   - Zn3ALU1
[5]   - Zn3ALU2
[6]   - Zn3ALU3
[7]   - Zn3BRU1
[14.0] - Zn3LSU
[14.1] - Zn3LSU
[14.2] - Zn3LSU
[15.0] - Zn3Load
[15.1] - Zn3Load
[15.2] - Zn3Load

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    
1.33   1.33   1.34   3.33   1.35   1.65   2.65   2.02   

[14.0] [14.1] [14.2] [15.0] [15.1] [15.2] 
 1.33   1.33   1.34   1.33   1.33   1.34

Against LLVM

[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]  
1.00   1.00   1.00   2.55   0.99   1.01   2.50   1.95

[14.0] [14.1] [14.2] [15.0] [15.1] [15.2] 
 1.00   1.00   1.00   1.00   1.00   1.00

It looks like the LLVM compiler does a much better job of optimizing the pipeline pressure.

Compiler Explorer (original): https://godbolt.org/z/abde4vv7r
Compiler Explorer (ret instructions excluded as suggested by Peter Cordes, select zen3 arch): https://godbolt.org/z/4K6814chq

If you are interested in only certain portions of the execution as the inner loop, you can mark the regions to be analized as in

Node* LowerBound(Node* x, int key) {
  Node* lower = nullptr;
  while (x != nullptr) {
    __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
    bool x_gte = !(x->key < key);
    lower = x_gte ? x : lower;
    x = x->links[!x_gte];
    __asm volatile("# LLVM-MCA-END foo":::"memory");
  }
  return lower;
}

This brings total cycles to 1303 for GCC and 1203 for LLVM.

Compiler Explorer: https://godbolt.org/z/8KoKfab34