Is it possible to temporarily suppress Intel CET for a single ret instruction, or otherwise use retpolines with it?

Intel CET (control-flow enforcement technology) consists of two pieces: SS (shadow stack) and IBT (indirect branch tracking). If you need to indirectly branch to somewhere that you can't put an endbr64 for some reason, you can suppress IBT for a single jmp or call instruction with notrack. Is there an equivalent way to suppress SS for a single ret instruction?

For context, I'm thinking about how this will interact with retpolines, which the key control flow of goes more-or-less like push real_target; call retpoline; pop junk; ret. If there's not a way to suppress SS for that ret, then is there some other way for retpolines to work when CET is enabled? If not, what options will we have? Will we need to maintain two sets of binary packages for everything, one for old CPUs that need retpolines, and one for new CPUs that support CET? And what about if Intel turns out to be wrong, and we do end up still needing retpolines on their new CPUs? Will we have to abandon CET to use them?

Solution

After playing with the assembly for a bit, I discovered that you can use retpolines with CET, but it's less than ideal. Here's how. For reference, consider this C code:

extern void (*fp)(void);

int f(void) {
    fp();
    return 0;
}

Compiling it with gcc -mindirect-branch=thunk -mfunction-return=thunk -O3 yields this:

f:
        subq    $8, %rsp
        movq    fp(%rip), %rax
        call    __x86_indirect_thunk_rax
        xorl    %eax, %eax
        addq    $8, %rsp
        jmp     __x86_return_thunk
__x86_return_thunk:
        call    .LIND1
.LIND0:
        pause
        lfence
        jmp     .LIND0
.LIND1:
        lea     8(%rsp), %rsp
        ret
__x86_indirect_thunk_rax:
        call    .LIND3
.LIND2:
        pause
        lfence
        jmp     .LIND2
.LIND3:
        mov     %rax, (%rsp)
        ret

It turns out you can make this work just by modifying the thunks to look like this:

__x86_return_thunk:
        call    .LIND1
.LIND0:
        pause
        lfence
        jmp     .LIND0
.LIND1:
        push    %rdi
        movl    $1, %edi
        incsspq %rdi
        pop     %rdi
        lea     8(%rsp), %rsp
        ret

__x86_indirect_thunk_rax:
        call    .LIND3
.LIND2:
        pause
        lfence
        jmp     .LIND2
.LIND3:
        push    %rdi
        rdsspq  %rdi
        wrssq   %rax, (%rdi)
        pop     %rdi
        mov     %rax, (%rsp)
        ret

By using the incsspq, rdsspq, and wrssq instructions, you can modify the shadow stack to match your changes to the real stack. I tested those modified thunks with Intel SDE, and they indeed made the control flow errors go away.

That was the good news. Here's the bad news:

Unlike endbr64, the CET instructions I used in the thunks aren't NOPs on CPUs that don't support CET (they result in SIGILL). This means you'd need two different sets of thunks, and you'd need to use CPU dispatch to pick the right ones depending on whether CET is available.
Using retpolines at all means that you're no longer doing any indirect branches, so while you'll still get the benefit of SS, you've completely negated IBT. I suppose you could work around this by making __x86_indirect_thunk_rax check for the presence of the endbr64 instruction, but that's really inelegant and would probably be really slow.