assembly cpu-architecture avr micro-architecture

How is it possible that the AVR microarchitecture can fetch 2 operands from the GP-Register to the ALU in only 1 clock cycle?

According to the Datasheets of AVR Microcontroller, as well as the Datasheet of the Instruction Set from the AVR architecture, certain instructions, for example ADD, can fetch 2 operands stored in the GP-Registers during only 1 Clock transition to the ALU. The Instruction-Word for the ADD instruction includes 2 addresses for GP-Registers; each 5 Bit wide, one for destination/source and one for source. But how is this implemented on a Hardware Level? Wouldn't the 5 Bits for the 2 Register interfere with each other as they are trying to access the GP-Register through the same Direct Addressing Bus?

Solution

Multi-ported register files are widely used in CPU designs. As wikipedia says:

Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports.

A quick google search found some slides with some gate / transistor level details about a multi-ported SRAM cell, and block diagrams of how to build a register file out of that.

This is not at all unique to AVR. Pipelined RISC CPUs in general are designed around executing (at least) 1 instruction per clock when nothing stalls, with the register file handling 2 reads + 1 write per cycle. e.g. MIPS and other classic 5-stage RISC pipelines. AVR is just an 8-bit version of those ideas.

@Margaret Bloom pointed out that multi-ported register files aren't the only implementation strategy. Given AVR's slow clock speed, the register file could be single-ported and simply clocked faster.

Modern superscalar CPUs have even wider register files.

For example, (https://www.agner.org/optimize/blog/read.php?i=857) Intel Skylake can sustain a throughput of reading at least 7 GP-integer registers per clock cycle, and at the same time write 3 registers in the same clock cycle. (And write FLAGS 3 times, thanks to register renaming breaking the WAW (write-after-write) hazard. Although this doesn't really count as separate; uops that produce a register and a FLAGS output can use the same physical register entry to hold both. The RAT keeps track of what comes from where.)

(Different loops can easily write 4 registers per clock cycle on modern Intel; the experiment I linked was mainly testing how many register reads I could get per clock, and unfused-domain uop throughput.)