Search code examples
x86intelinterruptpci

Can different CPUs on an x86 machine can have different local APIC register MMIO base addresses?


Intel manual says that local APIC registers are memory mapped to a 4KB region, with the default address being FEE00000H. This address can be modified using IA32_APIC_BASE MSR.

Quoting SDM Vol 3, section 10.4.5

The Pentium 4, Intel Xeon, and P6 family processors permit the starting address of the APIC registers to be relocated from FEE00000H to another physical address by modifying the value in the 24-bit base address field of the IA32_APIC_BASE MSR. This extension of the APIC architecture is provided to help resolve conflicts with memory maps of existing systems and to allow individual processors in an MP system to map their APIC registers to different locations in physical memory.

  1. Is it possible that different CPUs, on the same machine at the same time, can have different base addresses for local APIC ? Say, CPU 0 decides to stay at FEE00000H, but CPU1 move to FEF00000H

  2. If above is true, how can PCI MSI interrupts work ? If different CPUs can have different local APIC addresses, then MSI message address means different for different CPUs.


Solution

    1. Yes, as you quoted this is explicitly allowed by the Intel's manuals. However, since the P6 architecture the APIC accesses are handled internally by the processor, with no visible external bus cycle.

      For P6 family, Pentium 4, and Intel Xeon processors, the APIC handles all memory accesses to addresses within the 4-KByte APIC register space internally and no external bus cycles are produced.

      So the remapping is done only to avoid conflicts with legacy devices.
      Today, this is a rare circumstance since the region from 4 GiB - 18 MiB to 4GiB - 17 MiB (open-ended) is reserved for the MSI.

    2. From the PCI point-of-view, MSIs are extremely simple: write a value to an address.
      Both the address and the value are configured in the PCI address space of the device through the MSI or MSI-x capability structures (values 05h and 11h respectively).
      Only the MSI-x allows for 32-bit of data.

      The PCI specs are intentionally generic, the address and data pair form a unique "interrupt vector", i.e. a value that identifies an interrupt.
      First devices actually have a very limited extension of the MESSAGE_DATA field with only a few lower bits writable.
      In the x86 architecture, however, the address and data take a specific form.
      These are described in the section 10.11 of the Intel's manual 3A.

      Address format
      
      31   20  19            12  11         4   3    2  0
      0FEEH    Destination ID    Reserved   RH  DM   XX
      
      Data format
      
      63     16   15   14   13     11  10          8  7   0
      Reserved    TM   LM   Reserved   Delivery mode  Vector
      

      We can see that these formats are compatible with both MSI and MSI-x but more importantly that the address has a prefix of 0feeh making it at least 0fee00000h i.e. 4 GiB - 18 MiB.
      This area is used to route MSIs by the Host-to-PCI i.e. the Root Complex (either on the CPU chip or in the PCH) or the MCH (for older platforms).
      The address determines which set of processors will handle the MSI (the exact rules can be found in the Intel's manual).

      Even if all local APICs are mapped at the same address, it is the format of the destination address that selects a set of processors, the design of the x86's MSI is to allow the OS to direct interrupts to specific processors.
      So, no, the MSI address means the same thing for every CPU because it's not the CPU that handles the PCI write but the Host-to-PCI bridge and this chip is system-wide.


    How can the Host-to-PCI know where to send the MSI?

    The address has a destination ID (MDA in intel terms - Message Destination Address) that, along with some meta information, is enough to route the message in the "APIC bus" - a logical structure, implementation defined (probably the ring bus with some QPI/UPI/DMI segment), that connects the APICs and the clusters of APICs.
    Pretty much very similar to how network packets are routed.

    Don't the MSIs range and the default local APIC range overlap?

    Yes, but these ranges lives in two different "address space": the local APIC range lives inside each core or at most in the uncore (but not in the System Agent) the MSIs ranges lives in the Host-to-PCI chip.
    Writes from the PCI bus targeted to the MSIs range never leaves the Host-to-PCI bridge while accesses to the local APIC range never leave the cores.
    The communication between these two domains is through the APIC bus.


    Addendum

    A question arose in the comments: In an system with VT-d enabled (essentially a IOMMU), why there is the need for an Interrupt Remapping (IR) machinery? Why can't the DMA remapping suffice?

    Since MSI(-X) are memory writes initiated from a device the DMA remapping should be enough to remap interrupts.

    The DMA remapping translates addresses while a MSI(-X) is made of a target address (that conveys the destination of the interrupt) and a data (that specifies the type of interrupt to deliver).
    To allow the software to fully control the interrupt to be delivered the VT-d specification introduced a new format for the MSI(-X): the remappable format.
    The idea is similar to the standard memory translation used by the CPU: the MSI(-X) just have the necessary info to index a table with the full definition of the interrupt's target and type.

    The new format is:

    Remappable MSI(-X) format

    There are essentially three fields:

    • The Handle and Sub handle fields
      They are added together to get the index into the IR Table (IRT). They have been separated because the Sub handle is in the Data register (while the Handle is in the address register), this allows for configuring a device with a single address and multiple data values (apparently this turned out to be a necessary requirement).
    • The SHV field to tell if the Sub handle field is valid, otherwise zero is used instead.

    Note that the index is a 16-bit value.

    Once the index has been found, it is used to retrieve the IRT entry (IRTE) that contains all the information to deliver the interrupt - including some field to validate the source of the interrupt.

    Note that the legacy (Compatible format) MSI(-X) are never remapped, either they pass-through or they are blocked (depending on how the software configure the translator).