How do modern cpus handle crosspage unaligned access?

I'm trying to understand how unaligned memory access (UMA) works on modern processors (namely x86-64 and ARM architectures). I get that I might run into problems with UMA ranging from performance degradation till CPU fault. And I read about posix_memalign and cache lines.

What I cannot find is how the modern systems/hardware handle the situation when my request exceeds page boundaries?

Here is an example:

I malloc() an 8KB chunk of memory.
Let's say that malloc() doesn't have enough memory and sbrk()s 8KB for me.
The kernel gets two memory pages (4KB each) and maps them into my process's virtual address space (let's say that these two pages are not one after another in memory
movq (offset + $0xffc), %rax I request 8 bytes starting at the 4092th byte, meaning that I want 4 bytes from the end of the first page and 4 bytes from the beginning of the second page.

Physical memory:

---|---------------|---------------|-->
   |... 4b|        |        |4b ...|-->

I need 8 bytes that are split at the page boundaries.

How do MMU on x86-64 and ARM handle this? Are there any mechanisms in kernel MM to somehow prepare for this kind of request? Is there some kind of protection in malloc? What do processors do? Do they fetch two pages?

I mean to complete such request MMU has to translate one virtual address to two physical addresses. How does it handle such request?

Should I care about such things if I'm a software programmer and why?

I'm reading a lot of links from google, SO, drepper's cpumemory.pdf and gorman's Linux VMM book at the moment. But it's an ocean of information. It would be great if you at least provide me with some pointers or keywords that I could use.

Solution

I'm not overly familiar with the guts of the Intel architecture, but the ARM architecture sums this specific detail up in a single bullet point under "Unaligned data access restrictions":

An operation that performs an unaligned access can abort on any memory access that it makes, and can abort on more than one access. This means that an unaligned access that occurs across a page boundary can generate an abort on either side of the boundary.

So other than the potential to generate two page faults from a single operation, it's just another unaligned access. Of course, that still assumes all the caveats of "just another unaligned access" - namely it's only valid on normal (not device) memory, only for certain load/store instructions, has no guarantee of atomicity and may be slow - the microarchitecture will likely synthesise an unaligned access out of multiple aligned accesses¹, which means multiple MMU translations, potentially multiple cache misses if it crosses a line boundary, etc.

Looking at it the other way, if an unaligned access doesn't cross a page boundary, all that means is that if the aligned address for the first "sub-access" translates OK, the aligned addresses of any subsequent parts are sure to hit in the TLB. The MMU itself doesn't care - it just translates some addresses that the processor gives it. The kernel doesn't even come into the picture unless the MMU raises a page fault, and even then it's no different from any other page fault.

I've had a quick skim through the Intel manuals and their answer hasn't jumped out at me - however in the "Data Types" chapter they do state:

[...] the processor requires two memory accesses to make an unaligned access; aligned accesses require only one memory access.

so I'd be surprised if wasn't broadly the same (i.e. one translation per aligned access).

Now, this is something most application-level programmers shouldn't have to worry about, provided they behave themselves - outside of assembly language, it's actually quite hard to make unaligned accesses happen. The likely culprits are type-punning pointers and messing with structure packing, both things that 99% of the time one has no reason to go near, and for the other 1% are still almost certainly the wrong thing to do.

[1] The ARM architecture pseudocode actually specifies unaligned accesses as a series of individual byte accesses, but I'd expect implementations actually optimise this into larger aligned accesses where appropriate.