Here is a simplified singly-linked list, where each node owns the next, along with a function for destroying the list:
struct Node {
Node* next = nullptr;
~Node() {
delete next;
}
};
void Destroy(Node* head) {
delete head;
}
Clang 15.0.0 with -O3
(Compiler Explorer) gives recursive code for this that uses call
instructions, i.e. no tail call recursion through ~Node
itself, only operator delete(void*)
:
Destroy(Node*): # @Destroy(Node*)
test rdi, rdi
je .LBB0_1
push rbx
mov rbx, rdi
call Node::~Node() [base object destructor]
mov rdi, rbx
pop rbx
jmp operator delete(void*)@PLT # TAILCALL
.LBB0_1:
ret
Node::~Node() [base object destructor]: # @Node::~Node() [base object destructor]
push rbx
mov rbx, qword ptr [rdi]
test rbx, rbx
je .LBB1_1
mov rdi, rbx
call Node::~Node() [base object destructor]
mov rdi, rbx
pop rbx
jmp operator delete(void*)@PLT # TAILCALL
.LBB1_1:
pop rbx
ret
Here is an open-coded version of something similar:
struct Node2 {
Node2* next = nullptr;
};
void Destroy2(Node2* head) {
auto* const next = head->next;
delete head;
if (next) {
Destroy2(next);
}
}
Even with -01
clang turns the tail call into an efficient loop with O(1) stack frames involved for an arbitrary number of list nodes:
Destroy2(Node2*): # @Destroy2(Node2*)
push rbx
.LBB2_1: # =>This Inner Loop Header: Depth=1
mov rbx, qword ptr [rdi]
call operator delete(void*)@PLT
mov rdi, rbx
test rbx, rbx
jne .LBB2_1
pop rbx
ret
I understand that compiler optimizations aren't guaranteed, but I'm surprised clang isn't able to do something more efficient with the basic Destroy
case. It leads me to think that the key difference is in the fact that Destroy2
is able to free the memory for head
before it deals with head->next
. But it seems to me that shouldn't matter unless operator delete
is allowed to have some visible side effect.
Is there an important semantic difference between these two from the point of view of the abstract machine, preventing clang from optimizing the first case? If so, is there a way to make ~Node
more friendly to the optimizer so that I don't need to open-code a destroy function?
It's worth calling out explicitly something I alluded to in the question: that the operations in ~Node
are in the opposite order that they are in Destroy2
. The destructor is something like this:
// Pseudocode
~Node() {
delete next;
free(this);
}
So the question is why the compiler can't reverse those two operations, putting them in the same order as in Destroy2
and allowing a tail call to ~Node
for next
.
I can't say definitively that this is the answer, but Richard Smith points out that this could be because the compiler is limited in its ability to prove that anything preventing this would be UB. For example if there is a cycle in the list then maybe this could go wrong—a cycle would cause UB even for the existing code, but the compiler may not be able to see that. Perhaps more relevantly without an assumption about finite memory/object count (that the compiler probably doesn't make), the transformation isn't legal because the list may be unbounded without UB in the existing code.
As for what to do about it, Richard also points out that in C++20 we can use a destroying operator delete
to reorder the operations ourselves, giving the effect of Destroy2
with the convenience of just using delete
like normal:
struct Node {
Node* next = nullptr;
void operator delete(Node *p, std::destroying_delete_t) {
Node *next = std::exchange(p->next, nullptr);
::delete p;
delete next;
}
~Node() {
delete next;
}
};
As of today the generated code is good for raw pointers and bad for std::unique_ptr
, but the latter gets better if you add [[gnu::flatten]]
to the operator.