Naively assuming considered harmful: Prolog predicate with accumulator blows (global) stack, but naive version does not

I have tried a few versions of a simple predicate which pulls random values out of the extra-logical universe and puts them into a list. I assumed that the version with the accumulator would be be tail-call optimized, as nothing happens after recursive call so a path to optimization exists, but it is not (it uses the "global stack"). On the other hand, the "naive version" has apparently been optimized into a loop. This is SWI Prolog.

Why would the accumulator version be impervious to tail-call optimization?

Here are the predicate versions.

Slowest, runs out of local stack space (expectedly)

Here, we just allow a head with function symbols to make things explicit.

% Slowest, and uses 4 inferences per call (+ 1 at the end of recursion). 
% Uses "local stack" indicated in the "Stack limit (1.0Gb) exceeded" 
% error at "Stack depth: 10,321,204":
% "Stack sizes: local: 1.0Gb, global: 7Kb, trail: 1Kb"

oracle_rands_explicit(Out,Size) :- 
   Size>0, !, 
   NewSize is Size-1, 
   oracle_rands_explicit(R,NewSize), 
   X is random_float, 
   Out = [X-Size|R].
oracle_rands_explicit([],0).

?- oracle_rands_explicit(X,4).
X = [0.7717053554954681-4, 0.9110187097066331-3, 0.9500246711335888-2, 0.25987829195170065-1].

?- X = 1000000, time(oracle_rands_explicit(_,X)).
% 4,000,001 inferences, 1.430 CPU in 1.459 seconds (98% CPU, 2797573 Lips)

?- X = 50000000, time(oracle_rands_explicit(_,X)).
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 1.0Gb, global: 7Kb, trail: 1Kb
ERROR:   Stack depth: 10,321,204, last-call: 0%, Choice points: 6
ERROR:   Possible non-terminating recursion: ...

Faster, and does not run out of stack space

Again, we just allow a head with no function symbols to make things explicit, but we move the recursive call to the end of the body, which apparently makes a difference!

% Same number of inferences as Slowest, i.e. 4 inferences per call
% (+ 1 at the end of recursion), but at HALF the time.
% Does not run out of stack space! Conclusion: this is tail-call-optimized.

oracle_rands_explicit_last_call(Out,Size) :- 
   Size>0, !, 
   NewSize is Size-1,    
   X is random_float, 
   Out = [X-Size|R],
   oracle_rands_explicit_last_call(R,NewSize).
oracle_rands_explicit_last_call([],0).

?- oracle_rands_explicit_last_call(X,4).
X = [0.6450176209046125-4, 0.5605468429780708-3, 0.597052872950385-2, 0.14440970112076815-1].    

?- X = 1000000, time(oracle_rands_explicit_last_call(_,X)).
% 4,000,001 inferences, 0.697 CPU in 0.702 seconds (99% CPU, 5739758 Lips)

?- X = 50000000, time(oracle_rands_explicit_last_call(_,X)).
% 200,000,001 inferences, 32.259 CPU in 32.464 seconds (99% CPU, 6199905 Lips)

Compact, less inferences, and does not run out of stack space

Here we allow function symbols in the head for more compact notation. Still naive recursion.

% Only 3 inferences per call (+ 1 at the end of recursion), but approx % same time as "Faster". % Does not run out of stack space! Conclusion: this is tail-call-optimized.

oracle_rands_compact([X-Size|R],Size) :- 
   Size>0, !, 
   NewSize is Size-1,    
   X is random_float, 
   oracle_rands_compact(R,NewSize).
oracle_rands_compact([],0).

?- oracle_rands_compact(X,4).
X = [0.815764980826608-4, 0.6516093608470418-3, 0.03206964297092248-2, 0.376168614426895-1].

?- X = 1000000, time(oracle_rands_compact(_,X)).
% 3,000,001 inferences, 0.641 CPU in 0.650 seconds (99% CPU, 4678064 Lips)

?- X = 50000000, time(oracle_rands_compact(_,X)).
% 150,000,001 inferences, 29.526 CPU in 29.709 seconds (99% CPU, 5080312 Lips)

Accumulator-based and unexpectedly runs out of (global) stack space

% Accumulator-based, 3 inferences per call (+ 1 at the end of recursion + 1 at ignition),
% but it is often faster than the compact version.
% Uses "global stack" as indicated in the "Stack limit (1.0Gb) exceeded" 
% error at "Stack depth: 12,779,585":
% "Stack sizes: local: 1Kb, global: 0.9Gb, trail: 40.6Mb"

oracle_rands_acc(Out,Size) :- oracle_rands_acc(Size,[],Out).

oracle_rands_acc(Size,ThreadIn,ThreadOut) :- 
   Size>0, !, 
   NewSize is Size-1, 
   X is random_float, 
   oracle_rands_acc(NewSize,[X-Size|ThreadIn],ThreadOut).
oracle_rands_acc(0,ThreadIn,ThreadOut) :-
   reverse(ThreadIn,ThreadOut).

?- oracle_rands_acc(X,4).
X = [0.7768407880604368-4, 0.03425412654687081-3, 0.6392634169514991-2, 0.8340458397587001-1].

?- X = 1000000, time(oracle_rands_acc(_,X)).
% 4,000,004 inferences, 0.798 CPU in 0.810 seconds (99% CPU, 5009599 Lips)

?- X = 50000000, time(oracle_rands_acc(_,X)).
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 1Kb, global: 0.9Gb, trail: 40.6Mb
ERROR:   Stack depth: 12,779,585, last-call: 100%, Choice points: 6
ERROR:   In:
ERROR:     [12,779,585] user:oracle_rands_acc(37220431, [length:12,779,569], _876)

Addendum: Another version of the "compact" version.

Here we move the Size parameter to the first position, and do not use !. But indexing is a complex matter. Differences will probably be of note with many more clauses only.

oracle_rands_compact2(Size,[X-Size|R]) :- 
   Size>0, 
   NewSize is Size-1,    
   X is random_float, 
   oracle_rands_compact2(NewSize,R).
oracle_rands_compact2(0,[]).

Trying, with L instead of an anonymous variable, and L used after call.

X = 10000000, time(oracle_rands_compact2(X,L)),L=[].
% 30,000,002 inferences, 6.129 CPU in 6.159 seconds (100% CPU, 4894674 Lips)

X = 10000000, time(oracle_rands_compact(L,X)),L=[].
% 30,000,001 inferences, 5.865 CPU in 5.892 seconds (100% CPU, 5115153 Lips)

Maybe marginally faster. The above numbers vary a bit, one would really have to generate a full statistics over a hundred runs or so.

Does re-introducing the cut make it faster (it doesn't seem to make it slower)?

oracle_rands_compact3(Size,[X-Size|R]) :- 
   Size>0, !,
   NewSize is Size-1,    
   X is random_float, 
   oracle_rands_compact3(NewSize,R).
oracle_rands_compact3(0,[]).

?- X = 10000000, time(oracle_rands_compact3(X,L)),L=[].
% 30,000,001 inferences, 5.026 CPU in 5.061 seconds (99% CPU, 5969441 Lips)

Can't say, really.

Solution

It all depends on the top-level shell and the actual interpretation of the _. Try

 ?- X = 50000000, time(oracle_rands_compact(L,X)),L=[].

instead, it will be more or less as bad as the accumulator version which has to produce the entire list first only to hand it over to reverse/2. To see this use

?- set_prolog_flag(trace_gc, true).
true.

?- X = 50000000, time(oracle_rands_compact(_,X)).
% GC: gained 0+0 in 0.001 sec; used 440+8; free 126,520+129,008
% GC: gained 0+0 in 0.000 sec; used 464+16; free 126,496+129,000
% GC: gained 0+0 in 0.000 sec; used 464+16; free 126,496+129,000
...

?- X = 50000000, time(oracle_rands_compact(L,X)),L=[].
% SHIFT: l:g:t = 0:1:0 ...l+g+t = 131072+262144+131072 (0.000 sec)
% GC: gained 0+0 in 0.002 sec; used 123,024+16; free 135,008+129,000
% SHIFT: l:g:t = 0:1:0 ...l+g+t = 131072+524288+131072 (0.000 sec)
% GC: gained 0+0 in 0.003 sec; used 257,976+24; free 262,200+128,992
% SHIFT: l:g:t = 0:0:1 ...l+g+t = 131072+524288+262144 (0.000 sec)
% SHIFT: l:g:t = 0:1:0 ...l+g+t = 131072+1048576+262144 (0.000 sec)
% GC: gained 0+0 in 0.007 sec; used 520,104+16; free 524,360+260,072
...

If we are at it, your _compact version can be accelerated by exchanging the arguments and removing the cut. Classic first argument indexing is capable of handling this situation avoiding any choice point. (SWI has WAM style 1st argument indexing plus a lesser version for multiple arguments, last time I checked)