x86 intel cpu-architecture cpu-cache micro-architecture

How does the indexing of the Ice Lake's 48KiB L1 data cache work?

The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture.

^{¹ Software-visible latency/bandwidth will vary depending on access patterns and other factors.}

This baffled me because:

There are 96 sets (48 KiB / 64 / 8), which is not a power of two.
The indexing bits of a set and the indexing bits of the byte offset add to more than 12 bits, this makes the cheap-PIPT-as-VIPT-trick not available for 4KiB pages.

All in all, it seems that the cache is more expensive to handle but the latency increased only slightly (if it did at all, depending on what Intel means exactly with that number).

With a bit of creativity, I can still imagine a fast way to index 96 sets but point two seems an important breaking change to me.

What am I missing?

Solution

The optimization manual is wrong.

According to the CPUID instruction, the associativity is 12 (on a Core i5-1035G1). See also uops.info/cache.html and en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client).

This means that there are 64 sets, which is the same as in previous microarchitectures.