Since amd zen 4 has only 256bit wide operations on vector data, the following diagram from chipsandcheese's Zen 4 article shows 6 FP pipelines (4 ALU and 2 memory):
Each FMA does 1 multiplication and 1 add while fadd does only 1 add. So does this mean theoretically it can do a total of 2 multiplications and 4 adds = 6 operations of 256 bits each?
Assuming all 4adds and 2 muls can be issued in same cycle, can this mean 256bits (or just 8 floats of 32bit precision) x 6 = 48 elements are computed per cycle (or 48 gflops/s per GHz)?
Assuming all operands are in registers, there should be enough bandwidth to get the data to fpu (the L1 bandwidth says 2x256 bits per cycle for reading is only enough for 8 flops per cycle but registers must be much faster), but the fpu throughput isn't clearly shown.
How does this compare to Intel 11/12/13 gen? For example, some workstation xeons had 2x fpu of 512bits each but no dedicated "add"s? Is it fair to compare cpus with different ratios of muls and adds for flops-to-flops? Looks like amd is better on:
d += a * b + c;
// or
d += a * b;
e += c;
while intel is better on:
d = a * b + c;
// or
d+=a*b;
per gflops. Intel's flops value looks better for matrix multiplication and blending. AMD's flops value looks better for chained matrix add & multiplication and some loop with float accumulator & matrix multiplication.
So when doing matrix multiplication, is zen 4 effectively 32 flops per cycle?
Yes, 48 FLOP / cycle theoretical max throughput on Zen 4 if you have a use for adds and FMAs in the same loop.
I'd guess that usually this is most useful when you have many short-vector dot products that aren't matmuls, so each cleanup loop needs to do some shuffling and adding. Out-of-order exec can overlap that work with FMAs.
And in code not using FMAs, you still have 2 mul + 2 add per clock, which is potentially quite useful for less well optimized code. (A lot of real-life code is not well optimized. How many times have you seen people give advice to not worry about performance?)
Also with a mix of shuffles and other non-FP-math vector work, that can run on a good mix of ports and still leave some room for FP adds and multiplies.
AFAIK, Zen 4 can keep both FMA and both FP-ADD units busy at the same time, so yes, 2 vector FMAs and 2 vector vaddps
every cycle. So that's 6x vector-width FLOPs. It doesn't make sense to call it "4adds and 2 muls" being issued (and dispatched to execution units) in the same cycle, though, since the CPU sees them as 2 FMA and 2 ADD operations, not 6 separate uops.
So when doing matrix multiplication, is zen 4 effectively 32 flops per cycle?
Yes, standard matmal is all FMAs, little to no use for extra FP-add throughput.
Maybe some large-matrix multiplies using Strassen's algorithm would result in a workload with more than 1 addition per multiply, if you can arrange it such that the adding work overlaps with multiplying.
Or possibly run another thread on the same physical core doing the adding work, if you can arrange that without making things worse by competing for L1d cache footprint and bandwidth. HPC workloads sometimes scale negatively with SMT / hyperthreading for that reason, but partly that's because a well tuned single thread can use all the FP throughput from a single core. But if that's not the case on Zen 4, there's some theoretical room for gains.
However, that would require your FMA code to need less than 1 load per FMA, otherwise load/store uops will be the bottleneck if a submatrix-add thread is trying to load+load+add+store at the same time as a submatrix-multiply thread is doing 2 loads + 2 FMAs per clock.
For example, some workstation xeons had 2x fpu of 512bits each but no dedicated "add"s?
And yes, Intel CPUs with a second 512-bit FMA unit (like some Xeon scalable processors) can sustain 2x 512-bit FMAs per clock if you optimize your code well enough (e.g. not bottlenecking on loads+stores or FMA latency), so that gives you 2x 16 single-precision FMAs = 64 FLOP/cycle.
Alder Lake / Sapphire Rapids re-added separate execution units for FP-add, but they're on the same ports as the FMA units, so the benefit is lower latency for things that bottleneck on the latency of separate vaddps
/ vaddpd
, like in Haswell. (But unlike Haswell, there are two of them, so the throughput is still 2/clock.)