assembly x86-64 intel cpu-architecture instructions

Is Store Unit Also a Load Unit on an Intel processor?

Most Intel Processors have 2 load units and 1 store unit. Is store unit also a load unit? Are the instructions / micro-ops to modify existing memory data like inc [memory] just utilize 1 store unit and the remaining 2 load units are available for other micro-ops/instructions that can be executed in the same cycles, or instructions like inc take 1 load unit (to load existing value) plus 1 store unit (to store new value) so we have just one load unit left available? So, to keep 2 load units available, we can just purely store instructions like mov, push, etc?

Solution

A memory read-modify-write instruction is at least 4 unfused-domain uops on an Intel P6-family or Sandybridge-family. (It could be more if it needs more than 1 ALU uop)

There's no requirement that any of them execute in the same cycle, which the wording of your question seems to be assuming. Allowing out-of-order execution to do other work during the load-use latency is one of the major benefits of decoding x86 instructions into internal RISC-like uops.

You can see more details in Agner Fog's instruction tables. Consult his microarchitecture pdf to learn more about what that means. For anything I don't explain in this answer, you can find details in there.

For inc dword [rdi] on Intel Haswell, these are the uops (and ports they can run on):

load dword [rdi] (p2/p3), depends on rdi
ALU inc (p0/p1/p5/p6), dependent on data from the load
store-address [rdi] (p2/p3/p7), depends on rdi (but not the loaded data, I think)
store-data (p4) copies the ALU result into the store buffer, depends on the ALU and store-address uops, I think.

Note that only simple addressing modes ([reg] or [reg + constant]) can use the AGU on port7, but they can still be sent to p2 or p3 and steal load throughput. Other store addressing modes can only use p2/3. Load uops go to p2 or p3, and use AGU but also the load-data part of the execution unit.

This imperfect scheduling can and does impact sustained L1D bandwidth: Intel's optimization manual suggests that although the peak L1D bandwidth in Skylake-S is 64B read and 32B written in a single cycle, the sustained bandwidth is at best ~81B per cycle. (Table 2-4. Cache Parameters of the Skylake Microarchitecture on page 36)

inc [mem] definitely has to run a load uop. See Can num++ be atomic for 'int num'? for more details of how read-modify-write operations work (with/without a lock prefix). The CPU can't just send an "increment" command to DRAM or cache and have the operation happen "in memory".

Counting uops vs. ports makes more sense for throughput in a loop, or a long sequence of code. You can't know which uops will execute in the same cycle as each other, unless they were both waiting for the same input to become ready. Then you can predict that the oldest uop will go first if there aren't enough execution ports for the uops to run in parallel (this is called a resource conflict). So it may be better to put instructions on the critical path first, to reduce latency from resource conflicts.

Execution-port bottlenecks on a specific port are only one of the three common uop-throughput. The other two are:

Latency / dependency chains that limit available ILP
The front-end limit of 4 fused-domain uops per clock, rather than unfused-domain pressure on any specific port. (Or 5 on Ryzen, 6 when running multi-uop instructions).

So, other than cache misses and branch mispredicts, the impact of a sequence of instructions on surrounding independent work can be roughly characterized by its latency, fused-domain uop count, and uops for each port.

To save front-end decode and issue bandwidth, some of these uops can micro-fuse together. (Or AMD CPUs never did split them up in the first place, until they get to the execution units). See also Micro fusion and addressing modes for more about micro-fusion. (I have an unfinished update for that answer which adds a stand-alone description of micro-fusion to put everything in one place, since Agner Fog's guide omits un-lamination and Intel's optimization manual doesn't mention that HSW and later don't always un-laminate in cases where SnB did.)

inc dword [rsi] can only fuse the store-address + store-data uops together on Sandybridge-family, so it decodes to 3 fused-domain uops.

add dword [rsi], 1 can fuse the load with an ALU add uop, so it's only 2 total fused-domain uops for the issue stage to read from the IDQ and add to the ROB. It still expands to 4 unfused-domain uops to be assigned to ports and added to the scheduler (aka Reservation Station). (Yes, uops are assigned to ports at issue time in Intel CPUs).

Note that add sets flags different from inc, so they can't decode to exactly the same kind of internal uop. Presumably Intel decided it was worth it to let add uops fuse loads, because instructions like add eax, [rsi] are common. But inc + load fusion could only happen as part of inc [mem].