Search code examples
x86cpu-architecturetlbmmuhyperthreading

Does a hyper-threaded core share MMU and TLB?


To my knowledge, both MMU and TLB are not shared in a hyper-threaded core in Intel x86_64.

However, then, if two threads that don't share the address space are scheduled to the same physical core, how do they run?

I think, in that case, the threads don't have any chances to hit TLB, because the threads have their own address spaces.

If then, the performance will be so downgraded in my opinion.


Solution

  • The TLBs are organized in Intel processors as follows:

    • Intel NetBurst (the first to support HT): The ITLB is replicated. The DTLB is competitively shared.
    • Intel Nehalem (the second to support HT), Westmere, Sandy Bridge, and Ivy Bridge: The huge page ITLB is replicated. The small page ITLB is statically partitioned. All DTLBs are competitively shared.
    • Intel Haswell, Broadwell, and Skylake: The small page ITLB is dynamically partitioned. The huge page ITLB is replicated. Table 2-12 of the optimization manual (September 2019) says that the policy is "fixed" for the other TLBs. I thought this means static partitioning. But according to the experimental results of the paper titled Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks (Section 6), it appears that "fixed" means competitive sharing. That would be consistent with earlier and later microarchitectures.
    • Sunny Cove: The ITLBs are statically partitioned. All DTLBs and the STLB are competitively shared.
    • AMD Zen, Zen+, Zen 2 (Family 17h): All TLBs are competitively shared.

    References:

    It's not clear to me how the TLBs are organized in any of the Intel Atom microarchitectures. I think that the L1 DTLB and STLB (in Goldmont Plus) or L2 DTLB (in earlier microarchitectures) are shared. According to Section 8.7.13.2 of the Intel SDM V3 (October 2019):

    In processors supporting Intel Hyper-Threading Technology, data cache TLBs are shared. The instruction cache TLB may be duplicated or shared in each logical processor, depending on implementation specifics of different processor families.

    Although this is not accurate since an ITLB can be partitioned as well.

    I don't know about the ITLBs in Intel Atoms.

    (By the way, in older AMD processors, all the TLBs are replicated per core. See: Physical core and Logical cores on different cpu AMD/Intel.)

    When a TLB is shared, each TLB entry is tagged with the logical processor ID (a single bit value, which is different from the process-context identifier, which can be disabled or may not be supported) that allocated it. If another thread gets scheduled to run on a logical core and the thread accesses a different virtual address space than the previous thread, the OS has to load the corresponding base physical address of the first-level page structure into CR3. Whenever CR3 is written to, the core automatically flushes all entries in all shared TLBs that are tagged with the ID of the logical core. There are other operations that may trigger this flushing.

    Partitioned and replicated TLBs don't need to be tagged with logical core IDs.

    If process-context identifiers (PCIDs) are supported and enabled, logical core IDs are not used because PCIDs are more powerful. Note that partitioned and replicated TLBs are tagged with PCIDs.

    Related: Address translation with multiple pagesize-specific TLBs.

    (Note that there are other paging structure caches and they are organized similarly.)

    (Note that usually the TLB is considered to be part of the MMU. The Wikipedia article on MMU shows a figure from an old version of a book that indicates that they are separate. However, the most recent version of the book has removed the figure and says that the TLB is part of the MMU.)