I had read that when CPU read from memory, it will read word size of memory at once (like 4 bytes or 8 bytes). How can CPU achieve something like:
mov BYTE PTR [rbp-20], al
where it copies only one byte of data from al to the stack. (given the data bus width is like 64 bit wide) Will be great if anyone can provide information on how it's implemented on the hardware level.
And also, as we all know that when CPU execute program, it has program counter or instruction pointer that points to the address of the next instruction, and the control unit will fetch that instruction to the memory data register and executes it later. let's say:
0: b8 00 00 00 00 mov eax,0x0
is 5 byte code long (on x84) and
0: 31 c0 xor eax,eax
is 2 byte code long, they have various length of size.
if the control unit wants to fetch these instructions, does it:
what about instructions like :
0: 48 b8 5c 8f c2 f5 28 movabs rax,0x28f5c28f5c28f5c
7: 5c 8f 02
which exceeds the word size, how are they being handled by the CPU?
x86 is not a word-oriented architecture at all. Instructions are variable length with no alignment.
"Word size" is not a meaningful term on x86; some people may use it to refer to the register width, but instruction fetch / decode has nothing to do with the integer registers.
In practice on most modern x86 CPUs, instruction fetch from the L1 instruction cache happens in aligned 16-byte or 32-byte fetch blocks. Later pipeline stages find instruction boundaries and decode up to 5 instructions in parallel (e.g. Skylake). See David Kanter's write-up of Haswell for a block diagram of the front-end showing 16-byte instruction fetch from L1i cache.
But note that modern x86 CPUs also use a decoded-uop cache so they don't have to deal with the hard-to-decode x86 machine code for code that runs very frequently (e.g. inside a loop, even a large loop). Dealing with variable-length unaligned instructions is a significant bottleneck on older CPUs.
See Can modern x86 hardware not store a single byte to memory? for more about how the cache absorbs stores to normal memory regions (MTRR and/or PAT set to WB = Write-Back memory type).
The logic that commits stores from the store buffer to L1 data cache on modern Intel CPUs handles any store of any width as long as it's fully contained within one 64-byte cache line.
Non-x86 CPUs that are more word-oriented (like ARM) commonly use a read-modify-write of a cache word (4 or 8 bytes) to handle narrow stores. See Are there any modern CPUs where a cached byte store is actually slower than a word store? But modern x86 CPUs do spend the transistors to make cached byte stores or unaligned wider stores exactly as efficient as aligned 8-byte stores into cache.
given the data bus width is like 64 bit wide
Modern x86 has memory controllers built-in to the CPU. That DDR[1234] SDRAM bus has 64 data lines, but a single read or write command initiates a burst of 8 transfers, transferring 64 bytes of data. (Not coincidentally, 64 bytes is the cache line size for all existing x86 CPUs.)
For a store to an uncacheable memory region (i.e. if the CPU is configured to treat that address as uncacheable even though it's backed by DRAM), a single-byte or other narrow store is possible using the DQM byte-mask signals which tell the DRAM memory which of the 8 bytes are actually to be stored from this burst transfer.
(Or if that's not supported (which may be the case), the memory controller may have to read the old contents and merge, then store the whole line. Either way, 4-byte or 8-byte chunks are not the significant unit here. DDR burst transfers can be cut short, but only to 32 bytes down from 64. I don't think an 8-byte aligned write is actually very special at the DRAM level. It is guaranteed to be "atomic" in the x86 ISA, though, even on uncacheable MMIO regions.)
A store to an uncacheable MMIO region will result in a PCIe transaction of the appropriate size, up to 64 bytes.
Inside the CPU core, the bus between data cache and execution units can be 32 or 64 bytes wide. (Or 16 bytes on current AMD). And transfers of cache lines between L1d can L2 cache is also done over a 64-byte wide bus, on Haswell and later.