Does RAM take the same time to extract 16bit or 128bit?

Modern x86 CPU with SSE and AVX/2 has tons of registers

Table of registers

If I decide to use some of the biggest register (> 128bit) will my program slow down? Why?

I can't find a unique solution. If I understand correctly, depending on the model, the CPU extracts a certain amount of RAM each time (64, 128bit) but only if you use the bits you asked for. Is it right?

If possible, apply your explanation to this example:

mov al, 0xFF ;8bit ns=??
mov ax, 0xFFFF ;16bit ns=??
mov eax, 0xAABBAABB ; 32bit ns=??
mov rax, 0xAABBCCDDAABBCCDD
mov xmm0, ...
mov zmm0, variable512bit
; and the opposite
mov variable512bit, zmm0

Solution

The time required to fill a register from the L1 cache depends on the processor-L1 cache interface. The width of the processor-L1 interface is usually equal or smaller than a cache line. In Nehalem, you can load 16 bytes in one cycle even though cache line size is 64 bytes wide. Take a look here for some numbers for different architectures.

To answer your question with the assumption of L1 hit: As long as register size is equal or smaller than the processor-L1 interface, it does not slow down your code. Remember that if your access is not aligned, you incur two accesses to get data and that slows down your code.

In case of cache miss, the memory interface dictates your code performance. Note that memory bandwidth is much lower than cache bandwidth.

SIMD registers (like AVX and SSE) could be wider than the processor-L1 interface.