MSR vs MMIO/PCIe configuration space

Intel sometimes uses MSRs and sometimes "internal" PCIe devices to expose configuration options to the OS. I could not find any ressources which describe the advantages/reasons for using PCIe devices over MSRs. Since the MSR space is 2^32 there should be enough space to host the config space.

For example, a 3rd Gen Xeon 6346 has over 100 internal PCIe devices which expose different configuration options.

Solution

MSR space has the disadvantage that it can never be accessed directly from user mode (i.e., the RDMSR and WRMSR instructions can only be executed in kernel mode). The /dev/cpu/*/msr interface in Linux only allows one MSR to be read for each kernel call, and in most cases requires an interprocessor interrupt so that the kernel will execute the RDMSR/WRMSR instruction on the targeted logical processor. This leads to average overheads of thousands of cycles and thousands of instructions for each of the MSRs read. Reading all of the core and uncore counters in a 2-socket server can require well over 1000 kernel calls -- millions of instructions and milliseconds of wall clock time. (I sometimes use the "msr_batch" interface from https://github.com/LLNL/msr-safe to reduce this overhead, but I only have it installed on a few test systems.)

PCI configuration space can be accessed directly from user mode (with privileged access) by performing an mmap() on /dev/mem. (Not recommended for beginners!) This reduces overhead to the ~200-400 cycles required for each 32-bit uncached load (higher for cross-socket access). On the minus side, PCI configuration space is composed of 4KiB blocks, so it is not possible to use the virtual memory system to provide fine-grained access control within a 4KiB block.

Starting with Ice Lake Xeon, Intel has moved some of the uncore counters from PCI configuration space to memory-mapped PCI BARs. This provides more flexibility for organization and access control, and as a nice side effect also allows each counter to be read with a single 64-bit read instead of two 32-bit reads. A disadvantage is that the configuration is much more difficult to understand. Like PCI configuration space, you can crash your system with a misaligned load or store, so this requires very careful coding and testing.

At the hardware level, non-local MSR access (note 1), PCI configuration space access, and PCIe BAR access are almost certainly all converted to the transactions on the (undocumented) low-level messaging system -- so it may not make much difference to the engineers designing the performance monitoring units. (But it is certainly a pain for those of us trying to access the counters -- currently requiring 4 different mechanisms on Intel processors. There are also a few PCIe configuration space devices that give core-specific results. This is counterintuitive, to say the least.)

Note 1: "Local" MSR access might not require leaving the core+private cache block. The user-mode RDPMC instruction reads the same (core-local) performance counters that that RDMSR instruction can read, but does it at very low latency -- I recall numbers in the 25 core cycle range for RDPMC on Skylake Xeon. I have not tested to see if the RDMSR instruction (in kernel mode, operating on the PMC MSR addresses) executes at the same cost.