I'm trying to create a specpoline (cfr. Henry Wong) on my Kabe lake 7600U, I'm running CentOS 7.
The full testing repository is available on GitHub.
My version of the specpoline is as follow (cfr. spec.asm):
specpoline:
;Long dependancy chain
fld1
TIMES 4 f2xm1
fcos
TIMES 4 f2xm1
fcos
TIMES 4 f2xm1
%ifdef ARCH_STORE
mov DWORD [buffer], 241 ;Store in the first line
%endif
add rsp, 8
ret
This version is different from the Henry Wong's one in the way the flow is diverted into architectural path. While the original version used a fixed address I pass the target into the stack.
This way, an add rsp, 8
will remove the original return address and use the artificial one.
In the first part of the function I create a long latency dependency chain using some old FPU instructions, then an independent chain that attempt to fool the CPU return stack predictor.
The specpoline is inserted into a profiling context using a FLUSH+RELOAD1, the same assembly file also contains:
buffer
A continuous buffer spanning 256 distinct cache lines, each one separated by GAP-1
lines for a total of 256*64*GAP
bytes.
The GAP is used to prevent hardware prefetching.
A graphical depiction follows (each index is right after the other).
timings
An array of 256 DWORDs, each entry holds the time, in core cycles, needed to access the corresponding line in the F+R buffer.
flush
A small function to touch each page (with a store, just to be sure COW is on our side) of the F+R buffer and evicts the designated lines.
'profile`
Standard profile function that uses lfence+rdtsc+lence
well to profile a load from each line in the F+R buffer and store the result in the timings
array.
leak
This is the function that does the real work, call the specpoline
putting a store in the speculative path and the profile
function in the architectural path.
;Flush the F+R lines
call flush
;Unaligned stack, don't mind
lea rax, [.profile]
push rax
call specpoline
;O.O 0
; o o o SPECULATIVE PATH
;0.0 O
%ifdef SPEC_STORE
mov DWORD [buffer], 241 ;Just a number
%endif
ud2 ;Stop speculation
.profile:
;Ll Ll
; ! ! ARCHITECTURAL PATH
;Ll Ll
;Fill the timings array
call profile
A small C program is used to "bootstrap" the test harness.
The code use pre-processors conditional to conditionally put a store in the architectural path (actually in the specpoline itself) if ARCH_STORE
is defined and to conditionally put a store in the speculative path if SPEC_STORE
is defined.
Both store access the first line of the F+R buffer.
Running make run_spec
and make run_arch
will assemble spec.asm
with the corresponding symbol, compile the test and run it.
The test shows the timings for each line of the F+R buffer.
Store in the architectural path
38 230 258 250 212 355 230 223 214 212 220 216 206 212 212 234
213 222 216 212 212 210 1279 222 226 301 258 217 208 212 208 212
208 208 208 216 210 212 214 213 211 213 254 216 210 224 211 209
258 212 214 224 220 227 222 224 208 212 212 210 210 224 213 213
207 212 254 224 209 326 225 216 216 224 214 210 208 222 213 236
234 208 210 222 228 223 208 210 220 212 258 223 210 218 210 218
210 218 212 214 208 209 209 225 206 208 206 1385 207 226 220 208
224 212 228 213 209 226 226 210 226 212 228 222 226 214 230 212
230 211 226 218 228 212 234 223 228 216 228 212 224 225 228 226
228 242 268 226 226 229 224 226 224 212 299 216 228 211 226 212
230 216 228 224 228 216 228 218 228 218 227 226 230 222 230 225
228 226 224 218 225 252 238 220 229 1298 228 216 228 208 230 225
226 224 226 210 238 209 234 224 226 255 230 226 230 206 227 209
226 224 228 226 223 246 234 226 227 228 230 216 228 211 238 216
228 222 226 227 226 240 236 225 226 212 226 226 226 223 228 224
228 224 229 214 224 226 224 218 229 238 234 226 225 240 236 210
Store in the speculative path
298 216 212 205 205 1286 206 206 208 251 204 206 206 208 208 208
206 206 230 204 206 208 208 208 210 206 202 208 206 204 256 208
206 208 203 206 206 206 206 206 208 209 209 256 202 204 206 210
252 208 216 206 204 206 252 232 218 208 210 206 206 206 212 206
206 206 206 242 207 209 246 206 206 208 210 208 204 208 206 204
204 204 206 210 206 208 208 232 230 208 204 210 1287 204 238 207
207 211 205 282 202 206 212 208 206 206 204 206 206 210 232 209
205 207 207 211 205 207 209 205 205 211 250 206 208 210 278 242
206 208 204 206 208 204 208 210 206 206 206 206 206 208 204 210
206 206 208 242 206 208 206 208 208 210 210 210 202 232 205 207
209 207 211 209 207 209 212 206 232 208 210 244 204 208 255 208
204 210 206 206 206 1383 209 209 205 209 205 246 206 210 208 208
206 206 204 204 208 246 206 206 204 234 207 244 206 206 208 206
208 206 206 206 206 212 204 208 208 202 208 208 208 208 206 208
250 208 214 206 206 206 206 208 203 279 230 206 206 210 242 209
209 205 211 213 207 207 209 207 207 211 205 203 207 209 209 207
I put a store in the architectural path to test the timings function, it seems to work.
However I'm unable to get the same result with a store in the speculative path.
Why is the CPU not speculatively executing the store?
1 I confess I have never really spent time distinguishing all the cache profiling techniques. I hope I used the right name. By FLUSH+RELOAD I mean the process of evicting a set of lines, speculatively execute some code and then record the time to access each of the evicted line.
Your "long dep chain" is many uops from those microcoded x87 instructions. fcos
is 53-105 uops on SKL, with 50-130 cycle throughput. So it's about 1 cycle per uop delay, and the scheduler / reservation-station (RS) "only" has 97 entries in SKL/KBL. Also, getting later instructions into the out-of-order back end may be a problem, because microcode takes over the front-end and needs some kind of mechanism to decide which uops to issue next, probably depending on the result of some computation. (The number of uops is known to be data-dependent.)
If you want the most delay from an RS full of un-executed uops, a sqrtpd
dependency chain is probably your best bet. e.g.
xorps xmm0,xmm0 ; avoid subnormals that might trigger FP assists
times 40 sqrtsd xmm0, xmm0
; then make the store of the new ret addr dependent on that chain
movd ebx, xmm0
; and ebx, 0 ; not needed, sqrt(0) = 0.0 = integer bit pattern 0
mov [rsp+rbx], rax
ret
Since Nehalem, Intel CPUs have fast recovery for branch misses with a Branch Order Buffer that snapshots the OoO state (including the RAT and probably RS) What exactly happens when a skylake CPU mispredicts a branch?. So they can recover exactly to the mispredict without waiting for the mispredict to be the retirement state.
mov [rsp], rax
could execute as soon as it entered the RS, or at least without a dependency on the sqrt
dep chain. As soon as store-forwarding could produce the value, the ret
uop could execute and check the prediction, and detect the mispredict while the sqrt dep chain was still crunching. (ret
is 1 micro-fused uop for load ports + port 6, where the taken-branch execution unit is.)
Coupling the sqrtsd
dep chain into storing a new return address prevents the ret
from executing early. Executing a ret
uop in an execution port = checking the prediction and detecting the mispredict if there is one.
(Contrast with Meltdown, where the "wrong" path keeps running until the faulting load reaches retirement, and you want it to execute ASAP (just not retire). But you usually want to put a whole Meltdown attack in the shadow of something else, like TSX or a specpoline, in which case you would need something like this and have the whole meltdown in the shadow of this dep chain. Then Meltdown wouldn't need its own sqrtsd
dep chain.)
(vsqrtpd ymm
is still 1 uop on SKL, with worse throughput than xmm, but it has the same latency. So use sqrtsd
because it's the same length and presumably more energy-efficient.)
The best-case latency is 15 cycles vs. worst-case of 16 on SKL/KBL (https://agner.org/optimize), so it hardly matters what input you start with.
I initially used sqrtpd with similar results. However I didn't initialise the XMM register used as input (and output) thinking it didn't matter. I tested again but this time I initialised the register with two doubles of value 1e200 and what I get is an intermittent result. Sometime the line is fetched speculatively sometime it is not.
If XMM0 holds a subnormal (e.g. bit pattern is a small integer), sqrtpd takes a microcode assist. (fp_assist.any
perf counter). Even if the result is normal but the input is subnormal. I tested both cases on SKL, with this loop:
pcmpeqd xmm0,xmm0
psrlq xmm0, 61 ; or 31 for a subnormal input whose sqrt is normalized
addpd xmm0,xmm0 ; avoid domain-crossing vec-int -> vec-fp weirdness
mov ecx, 10000000
.loop:
sqrtpd xmm1, xmm0
dec ecx
jnz .loop
mov eax,1
int 0x80 ; sys_exit
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,fp_assist.any
shows 1 assist per iteration for denormal inputs, with 951M
uops issued (and ~160 cycles per iteration). So we can conclude a microcode assist for sqrtpd
takes ~95 uops in this case, and has a throughput cost ~160 cycles when it happens back-to-back.
vs. 20M uops issued total for input = NaN (all-ones), with 4.5 cycles per iteration. (The loop runs 10M sqrtpd
uops, and 10M macro-fused dec/jcc uops.)