Where can I find data about "market share" of x86 microarchitectures? What percentage of users of x86-family CPUs have a CPU that supports SSE4.2, AVX, AVX2, etc.?
I'm distributing precompiled binaries for my program, and I would like to know what is the best optimization target, and which SIMD extensions can be reasonably used without runtime checks.
I can find overall Intel vs AMD market share data, but not a breakdown of generations of Intel's and AMD's CPUs. Ideally I'd like breakdown also per OS and per country, but even general global stats for microarchitectures would be better than nothing.
Anything newer than SSE2 (baseline for x86-64) without runtime checks is risky if there's no fallback or install-time detection.
AVX and BMI1/2 are sadly very far from being baseline, because Intel is still selling Celeron/Pentium chips with VEX prefix decoding disabled (presumably to make use of silicon with defects in 256-bit execution units), but SSE4.2 is getting closer, and SSSE3 is a possibility. See Most recent processor without support of SSSE3 instructions?, and Mac OSX minumum support sse version
Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions? has a link to the Valve Hardware Survey for Steam clients (currently showing SSE3 as ~100% installed base, but SSSE3 only at 97%), so if you're shipping a PC game that should correlate pretty well with your target audience. The breakdowns are a bit weird, though, for some entries. Like fcmov
(x87 branchless conditional-move) is reported as having done down to 97.5%, but every P6-compatible CPU has it. You won't find a CPU with SSE2 but without FCMOV. Perhaps newer versions of Steam aren't testing for it. And perhaps older versions of Steam aren't testing for CMPXCHG16B? So take them with a grain of salt, but they're probably fairly sensible for SSE2/3/SSSE3/SSE4.x, and AVX.
For server stuff, you might easily be able to set an SSE4.2 minimum. Atom/Silvermont support it, and so do AMD's and VIA's low-power architectures, so energy-efficient servers can run it. Ancient mainstream CPUs don't tend to get much use for servers outside of personal home-server use, because they're often slower than a cheaper modern machine that runs cooler.
(Silvermont isn't likely to support AVX soon, even less AVX2 or FMA.)
You don't have to limit yourself to a single binary. You could even let people pick when they download, or your installer could select at install time.
Or you could have a run-time wrapper that picks an executable and dynamic libraries, so you effectively get runtime dispatching while still being able to compile with gcc -O3 -march=haswell
or whatever to let the compiler use new instruction sets all over the place (beneficial especially for BMI1/BMI2 for efficient single-uop variable-count shifts).
Another option is dynamic linker tricks, either on a whole-library basis or on a per-function basis like glibc uses to resolve memcpy
to __memset_avx2_unaligned_erms
. perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?
All of these (except the per-function dynamic linker tricks) are easier than making your code aware of instruction-set extensions at runtime, and have zero performance overhead. (Unless you put stuff in a dynamic library when you wouldn't have otherwise, so it can't inline.)