Why read the id register in __turn_mmu_on?

After a few of days reading code about the ARM linux kernel booting process, I understood most of them except several tricky parts in function __turn_mmu_on:

        .align  5
   __turn_mmu_on:
      mov   r0, r0
      mcr   p15, 0, r0, c1, c0, 0       @ write control reg
      mrc   p15, 0, r3, c0, c0, 0       @ read id reg
      mov   r3, r3
      mov   r3, r3
      mov   pc, r13
   ENDPROC(__turn_mmu_on)

The last instruction mov pc, r13 will branch to __mmap_switched, as follows:

    __mmap_switched:
        adr r3, __switch_data + 4
        ....

Why it is necessary to align it at 32 byte (which is the size of a cache line) boundary?
What's the purpose of reading the value of ID register (whose value is not even used), since register r3 is simply overwritten in instruction adr r3, __switch_data + 4?

Solution

Alignment is probably not required, but is likely used to ensure that the whole function fits into a cache line and so the last few instructions will be executed from the cache and won't have to be fetched from the memory (even though the function should remain at the same address with MMU on because it's identity mapped).

It was not easy to track down the origin of the MRC instruction but I think I found it:

Date: 2004-04-04 04:35 +200
To: linux-arm-patches
Subject: [Linux-arm-patches] 1204.1: XSCALE processor stalls when enabling MMU
--- kernel-source-2.5.21-rmk/arch/arm/kernel/head.S    Sun Jun  9 07:26:29 2002
+++ kernel-2.5.21-was/arch/arm/kernel/head.S    Fri Jul 12 20:41:42 2002
@@ -118,9 +118,7 @@ __turn_mmu_on:
     orr    r0, r0, #2            @ ...........A.
 #endif
     mcr    p15, 0, r0, c1, c0
-    mov    r0, r0
-    mov    r0, r0
-    mov    r0, r0
+    cpwait    r10
     mov    pc, lr
[...]
+/*
+ * cpwait - wait for coprocessor operation to finish
+ * this is the canonical way to wait for cp updates
+ * on PXA2x0 as proposed by Intel
+ */
+    .macro    cpwait reg
+    mrc    p15, 0, \reg, c2, c0, 0    @ arbitrary cp reg read
+    mov    r0, r0                    @ nop
+    sub    pc, pc, #4                @ nop
+    .endm

The ensuing discussion on merits of this patch ended in the current approach:

...
We can however get closer to the Xscale recommended sequence by knowing how things work on other CPUs, and knowing what we're doing here. If we insert the following instruction after the mcr, then this should solve your issue.
mrc p15, 0, r0, c1, c0
Since the read-back of the same register is guaranteed by the ARM architecture manual to return the value that was written there (if it doesn't, the CPU isn't an ARM compliant implementation), this means we can guarantee that the write to the register has taken effect. The use of the "mov r0, r0" instructions are the same as in the CPWAIT macro. The mov pc, lr is equivalent to the "sub pc, pc, #4" (they are defined to be the same class of instructions), so merely adding one instruction should guarantee that the Xscale works as expected.
...

The original patch was from Lothar Wassmann, the final code is probably by Russel King.