Can SIPI be sent from a BSP running in long mode?

Currently I have an multi-processing operating system running in x86 protected mode, and I want to make it run in x86_64 long mode. Its current logic to wake up APs is by sending SIPI-INIT-INIT:

// BSP already entered protected mode, set up page tables
uint32_t *icr = 0xfee00300;
*icr = 0x000c4500ul;        // send INIT
delay_us(10000);            // delay for 10 ms
while (*icr & 0x1000);      // wait until Send Pending bit is clear
for (int i = 0; i < 2; i++) {
    *icr = 0x000c4610ul;    // send SIPI
    delay_us(200);          // delay for 200 us
    while (*icr & 0x1000);  // wait until Send Pending bit is clear
}

This program works well in 32-bit protected mode.

However, after I modified the operating system to run in 64-bit long mode, the logic breaks when sending SIPI. In QEMU, immediately after executing the send SIPI line, the BSP is reset (program counter goes to 0xfff0).

In Intel's software developer's manual volume 3, section 8.4.4.1 (Typical BSP Initialization Sequence), it says that BSP should "Switches to protected mode". Does this process apply to long mode? How should I debug this problem?

Here are some debug information, if helpful:

CPU registers before sending SIPI instruction (movl $0xc4610,(%rax)) in 64-bit long mode:

rax            0xfee00300          4276093696
rbx            0x40                64
rcx            0x0                 0
rdx            0x61                97
rsi            0x61                97
rdi            0x0                 0
rbp            0x1996ff78          0x1996ff78
rsp            0x1996ff38          0x1996ff38
r8             0x1996ff28          429326120
r9             0x2                 2
r10            0x0                 0
r11            0x0                 0
r12            0x0                 0
r13            0x0                 0
r14            0x0                 0
r15            0x0                 0
rip            0x1020d615          0x1020d615
eflags         0x97                [ IOPL=0 SF AF PF CF ]
cs             0x10                16
ss             0x18                24
ds             0x18                24
es             0x18                24
fs             0x18                24
gs             0x18                24
fs_base        0x0                 0
gs_base        0x0                 0
k_gs_base      0x0                 0
cr0            0x80000011          [ PG ET PE ]
cr2            0x0                 0
cr3            0x19948000          [ PDBR=12 PCID=0 ]
cr4            0x20                [ PAE ]
cr8            0x0                 0
efer           0x500               [ LMA LME ]
mxcsr          0x1f80              [ IM DM ZM OM UM PM ]

CPU registers before sending SIPI instruction (movl $0xc4610,(%eax)) in 32-bit protected mode:

rax            0xfee00300          4276093696
rbx            0x40000             262144
rcx            0x0                 0
rdx            0x61                97
rsi            0x2                 2
rdi            0x102110eb          270602475
rbp            0x19968f4c          0x19968f4c
rsp            0x19968f04          0x19968f04
r8             0x0                 0
r9             0x0                 0
r10            0x0                 0
r11            0x0                 0
r12            0x0                 0
r13            0x0                 0
r14            0x0                 0
r15            0x0                 0
rip            0x1020d075          0x1020d075
eflags         0x97                [ IOPL=0 SF AF PF CF ]
cs             0x8                 8
ss             0x10                16
ds             0x10                16
es             0x10                16
fs             0x10                16
gs             0x10                16
fs_base        0x0                 0
gs_base        0x0                 0
k_gs_base      0x0                 0
cr0            0x80000015          [ PG ET EM PE ]
cr2            0x0                 0
cr3            0x19942000          [ PDBR=12 PCID=0 ]
cr4            0x30                [ PAE PSE ]
cr8            0x0                 0
efer           0x0                 [ ]
mxcsr          0x1f80              [ IM DM ZM OM UM PM ]

Solution

Can SIPI be sent from a BSP running in long mode?

Yes. The only thing that matters is that you write the right values to the right local APIC registers (with the right delays, sort of - see my method at the end).

However, after I modified the operating system to run in 64-bit long mode, the logic breaks when sending SIPI. In QEMU, immediately after executing the send SIPI line, the BSP is reset (program counter goes to 0xfff0).

I'd assume that either:

a) there's a bug and the address of the local APIC's registers isn't right; causing a triple fault when you attempt to write to the local APIC's register. Don't forget that long mode must use paging, and even though 0xFEE00300 is likely to be the correct physical address it can be the wrong virtual address (unless you took care of that by identity mapping that specific page when porting the OS to long mode).

b) The data isn't right for some hard to imagine reason, causing the SIPI to restart the BSP.

In Intel's software developer's manual volume 3, section 8.4.4.1 (Typical BSP Initialization Sequence), it says that BSP should "Switches to protected mode". Does this process apply to long mode?

Intel's "Typical BSP Initialization Sequence" is just one possible example that's only intended for firmware developers. Note that "intended for firmware developers" means that it should not be used by any OS.

The main problem with Intel's example is that it broadcasts the INIT-SIPI-SIPI sequence to all other CPUs (possibly including CPUs that the firmware disabled because they're faulty, and possibly including CPUs that the firmware disabled for other reasons - e.g. because the user disabled hyper-threading); and fails to detect "CPU exists but failed to start for some reason" (which an OS should report to user).

The other problem is that typically an OS will want to pre-allocate a stack for each AP before starting it (and store "address you should use for stack" somewhere before starting an AP), and you can't give each AP its own stack like that if you're starting an unknown number of CPUs at the same time.

Essentially; firmware uses (something like) the example Intel described, then builds information in an "ACPI/MADT" ACPI table (and/or a "MultiProcessor specification table" for very old computers - it's obsolete now) for the OS to use; and the OS uses information from the firmware's table/s to find the physical address of the local APIC in a correct (vendor and platform neutral) way, and find only the CPUs that the firmware says are valid and determine if those CPU/s are using "local APIC" or "X2APIC" (which supports more than 256 APIC IDs and is necessary if there's a huge number of CPUs); and then only starts valid CPUs one at a time while using a time-out so that "CPU #123, that I have proof exists, has failed to start" can be reported to user and/or logged.

I should also point out that Intel's example has existed in Intel's manuals mostly unchanged for about 25 years (since before long mode was introduced).

My Method

The delays in Intel's algorithm are annoying, and often a CPU will start on the first SIPI, and sometimes the second SIPI will cause the same CPU to be started twice (causing problems if you have any kind of "started_CPUs++;" in the AP startup code).

To fix these problems (and improve performance) the AP startup code can set an "I started" flag, and instead of having a "delay_us(200);" after the sending the first SIPI the BSP can monitor the "I started" flag with a time-out, and skip the second SIPI (and the remainder of the time-out) if the AP already started. In this case the time-out between SIPIs can be longer (e.g. 500 us is fine) and more importantly needn't be so precise; and the same "wait for flag with time-out" code can be re-used after sending the second SIPI (if the second SIPI needed to be sent), with a much longer time-out.

This alone doesn't completely solve the "CPU started twice" problem; and it doesn't solve the "AP started after second SIPI, but started after time-out expired, so now there's 2 APs running and OS only knows about one". These problems are fixed with extra synchronization - specifically, AP sets the "I started flag" and then it can wait for BSP to set a "you can continue if your APIC ID is ...." value to be set (and if the AP detects that the APIC ID value is wrong it can do a "CLI then HLT" loop to shut itself down).

Finally; if you do the whole "INIT-SIPI-SIPI" sequence one CPU at a time, then it can be slow if there's lots of CPUs (e.g. at least a whole second for 100 CPUs due to the 10 ms delay after sending INIT). This can be significantly reduced by using 2 different methods:

a) starting CPUs in parallel. For best case; BSP can start 1 AP, then BSP+AP can start 2 more APs, then BSP+3 APs can start 4 more APs, etc. This means 128 CPUs can be started in slightly more than 70 ms (instead of over a whole second). To make this work (to give each AP different values to use for stack, etc) it's best to use multiple AP CPU startup trampolines (e.g. so that an AP can do "mov esp,[cs:stackPointer]" where different APs are started with different values in cs because that came from the SIPI).

b) Sending multiple INITs to multiple CPUs one at a time; then having one 10 ms delay; then doing the later "SIPI-SIPI" sequence one CPU at a time. This relies on the later "SIPI-SIPI" sequence being relatively fast (compared to the huge 10 ms delay after INIT) and the CPU not being too fussy about the exact length of that 10 ms delay. For example; if you send 4 INITs to 4 CPUs and you know that (for a worst case) the SIPI-SIPI takes 1 ms for the OS to decide that the CPU failed to start; then there'd be a delay of 13 ms between sending the INIT to the fourth/last CPU and sending the first SIPI to the fourth/last CPU.

Note that if you're brave, both of these approaches can be combined (e.g. you could start 128 CPUs in a little more than 50 ms).