Search code examples
capple-m1arm64neonmicrobenchmark

Cycle count neon for M2?


Is there a resource on how many cycles SIMD is on apple M1/M2? Like x86 https://uops.info/table.html or agner fog? I wish I could give a bigger bounty but that's all the rep I have

I never programmed on a ARM machine. I took a look at sse2neon
https://github.com/DLTcollab/sse2neon/blob/7bd15eac51e36bf7426052f8515358cb665d8c04/sse2neon.h

The first thing I looked up was setzero. I was doubting that dup was the way to go so I tried nanobench and saw xor was faster, and that sub itself wasn't the same.

Is there something I can look up to get a rough idea? My target is M2

#include <arm_neon.h>
#define ANKERL_NANOBENCH_IMPLEMENT
#include "nanobench.h"

int32x4_t setzeroA()
{
    return vdupq_n_s32(0);
}
int32x4_t setzeroB()
{
    int32x4_t v;
    return vsubq_u32(v, v);
}
uint8x16_t setzeroC()
{
    uint8x16_t v;
    return veorq_u8(v, v);
}

int main() {
    ankerl::nanobench::Bench().run("Set", [&] {
        auto v = setzeroA();
        ankerl::nanobench::doNotOptimizeAway(v);
    });
    ankerl::nanobench::Bench().run("sub", [&] {
        auto v = setzeroB();
        ankerl::nanobench::doNotOptimizeAway(v);
    });
    ankerl::nanobench::Bench().run("xor", [&] {
        auto v = setzeroC();
        ankerl::nanobench::doNotOptimizeAway(v);
    });
}

Solution

  • These are from the M1, but I doubt anything major has changed with the M2.

    Big: https://dougallj.github.io/applecpu/firestorm-simd.html

    Little: https://dougallj.github.io/applecpu/icestorm-simd.html