c++clang tail-recursion tail-call-optimization tail-call

Why does clang have trouble with optimizing tail calls in destructors?

Here is a simplified singly-linked list, where each node owns the next, along with a function for destroying the list:

struct Node {
  Node* next = nullptr;

  ~Node() {
    delete next;
  }
};

void Destroy(Node* head) {
    delete head;
}

Clang 15.0.0 with -O3 (Compiler Explorer) gives recursive code for this that uses call instructions, i.e. no tail call recursion through ~Node itself, only operator delete(void*):

Destroy(Node*):                       # @Destroy(Node*)
        test    rdi, rdi
        je      .LBB0_1
        push    rbx
        mov     rbx, rdi
        call    Node::~Node() [base object destructor]
        mov     rdi, rbx
        pop     rbx
        jmp     operator delete(void*)@PLT                      # TAILCALL
.LBB0_1:
        ret
Node::~Node() [base object destructor]:                           # @Node::~Node() [base object destructor]
        push    rbx
        mov     rbx, qword ptr [rdi]
        test    rbx, rbx
        je      .LBB1_1
        mov     rdi, rbx
        call    Node::~Node() [base object destructor]
        mov     rdi, rbx
        pop     rbx
        jmp     operator delete(void*)@PLT                      # TAILCALL
.LBB1_1:
        pop     rbx
        ret

Here is an open-coded version of something similar:

struct Node2 {
  Node2* next = nullptr;
};

void Destroy2(Node2* head) {
    auto* const next = head->next;
    delete head;

    if (next) {
      Destroy2(next);
    }
}

Even with -01 clang turns the tail call into an efficient loop with O(1) stack frames involved for an arbitrary number of list nodes:

Destroy2(Node2*):                     # @Destroy2(Node2*)
        push    rbx
.LBB2_1:                                # =>This Inner Loop Header: Depth=1
        mov     rbx, qword ptr [rdi]
        call    operator delete(void*)@PLT
        mov     rdi, rbx
        test    rbx, rbx
        jne     .LBB2_1
        pop     rbx
        ret

I understand that compiler optimizations aren't guaranteed, but I'm surprised clang isn't able to do something more efficient with the basic Destroy case. It leads me to think that the key difference is in the fact that Destroy2 is able to free the memory for head before it deals with head->next. But it seems to me that shouldn't matter unless operator delete is allowed to have some visible side effect.

Is there an important semantic difference between these two from the point of view of the abstract machine, preventing clang from optimizing the first case? If so, is there a way to make ~Node more friendly to the optimizer so that I don't need to open-code a destroy function?

Solution

It's worth calling out explicitly something I alluded to in the question: that the operations in ~Node are in the opposite order that they are in Destroy2. The destructor is something like this:

// Pseudocode
~Node() {
  delete next;
  free(this);
}

So the question is why the compiler can't reverse those two operations, putting them in the same order as in Destroy2 and allowing a tail call to ~Node for next.

I can't say definitively that this is the answer, but Richard Smith points out that this could be because the compiler is limited in its ability to prove that anything preventing this would be UB. For example if there is a cycle in the list then maybe this could go wrong—a cycle would cause UB even for the existing code, but the compiler may not be able to see that. Perhaps more relevantly without an assumption about finite memory/object count (that the compiler probably doesn't make), the transformation isn't legal because the list may be unbounded without UB in the existing code.

As for what to do about it, Richard also points out that in C++20 we can use a destroying operator delete to reorder the operations ourselves, giving the effect of Destroy2 with the convenience of just using delete like normal:

struct Node {
  Node* next = nullptr;

  void operator delete(Node *p, std::destroying_delete_t) {
      Node *next = std::exchange(p->next, nullptr);
      ::delete p;
      delete next;
  }

  ~Node() {
    delete next;
  }
};

As of today the generated code is good for raw pointers and bad for std::unique_ptr, but the latter gets better if you add [[gnu::flatten]] to the operator.