Search code examples
assemblyx86-64intelspectre

Is it possible to temporarily suppress Intel CET for a single ret instruction, or otherwise use retpolines with it?


Intel CET (control-flow enforcement technology) consists of two pieces: SS (shadow stack) and IBT (indirect branch tracking). If you need to indirectly branch to somewhere that you can't put an endbr64 for some reason, you can suppress IBT for a single jmp or call instruction with notrack. Is there an equivalent way to suppress SS for a single ret instruction?

For context, I'm thinking about how this will interact with retpolines, which the key control flow of goes more-or-less like push real_target; call retpoline; pop junk; ret. If there's not a way to suppress SS for that ret, then is there some other way for retpolines to work when CET is enabled? If not, what options will we have? Will we need to maintain two sets of binary packages for everything, one for old CPUs that need retpolines, and one for new CPUs that support CET? And what about if Intel turns out to be wrong, and we do end up still needing retpolines on their new CPUs? Will we have to abandon CET to use them?


Solution

  • After playing with the assembly for a bit, I discovered that you can use retpolines with CET, but it's less than ideal. Here's how. For reference, consider this C code:

    extern void (*fp)(void);
    
    int f(void) {
        fp();
        return 0;
    }
    

    Compiling it with gcc -mindirect-branch=thunk -mfunction-return=thunk -O3 yields this:

    f:
            subq    $8, %rsp
            movq    fp(%rip), %rax
            call    __x86_indirect_thunk_rax
            xorl    %eax, %eax
            addq    $8, %rsp
            jmp     __x86_return_thunk
    __x86_return_thunk:
            call    .LIND1
    .LIND0:
            pause
            lfence
            jmp     .LIND0
    .LIND1:
            lea     8(%rsp), %rsp
            ret
    __x86_indirect_thunk_rax:
            call    .LIND3
    .LIND2:
            pause
            lfence
            jmp     .LIND2
    .LIND3:
            mov     %rax, (%rsp)
            ret
    

    It turns out you can make this work just by modifying the thunks to look like this:

    __x86_return_thunk:
            call    .LIND1
    .LIND0:
            pause
            lfence
            jmp     .LIND0
    .LIND1:
            push    %rdi
            movl    $1, %edi
            incsspq %rdi
            pop     %rdi
            lea     8(%rsp), %rsp
            ret
    
    __x86_indirect_thunk_rax:
            call    .LIND3
    .LIND2:
            pause
            lfence
            jmp     .LIND2
    .LIND3:
            push    %rdi
            rdsspq  %rdi
            wrssq   %rax, (%rdi)
            pop     %rdi
            mov     %rax, (%rsp)
            ret
    

    By using the incsspq, rdsspq, and wrssq instructions, you can modify the shadow stack to match your changes to the real stack. I tested those modified thunks with Intel SDE, and they indeed made the control flow errors go away.

    That was the good news. Here's the bad news:

    1. Unlike endbr64, the CET instructions I used in the thunks aren't NOPs on CPUs that don't support CET (they result in SIGILL). This means you'd need two different sets of thunks, and you'd need to use CPU dispatch to pick the right ones depending on whether CET is available.
    2. Using retpolines at all means that you're no longer doing any indirect branches, so while you'll still get the benefit of SS, you've completely negated IBT. I suppose you could work around this by making __x86_indirect_thunk_rax check for the presence of the endbr64 instruction, but that's really inelegant and would probably be really slow.