The gcc docs talk about a difference between prefetch for read and prefetch for write. What is the technical difference?
On the CPU level, a software prefetch (as opposed to ones trigger by the hardware itself) are a convenient way to hint to the CPU that a line is about to be accessed, and you want it prefetched in advance to save the latency.
If the access will be a simple read, you would want a regular prefetch, which would behave similarly to a normal load from memory (aside from not blocking the CPU in case it misses, not faulting in case the address is bad, and all sorts of other benefits, depending on the micro architecture).
However, if you intend to write to that line, and it also exists in another core, a simple read operation would not suffice. This is due to MESI-based cache handling protocols. A core must have ownership of a line before modifying it, so that it preserves coherency (if the same line gets modified in multiple cores, you will not be able to ensure correct ordering for these changes, and may even lose some of them, which is not allowed on normal WB memory types). Instead, a write operation will start by acquiring ownership of the line, and snooping it out of any other core / socket that may hold a copy. Only then can the write occur. A read operation (demand or prefetch) would have left the line in other cores in a shared state, which is good if the line is read multiple times by many cores, but doesn't help you if your core later writes to it.
To allow useful prefetching for lines that will later be written to, most CPU companies support special prefetches for writing. In x86, both Intel and AMD support the prefetchW instruction, which should have the effect of a write (i.e. - acquiring sole ownership of a line, and invalidating any other copy if it). Note that not all CPUs support that (even within the same family, not all generations have it), and not all compiler versions enable it.
Here's an example (with gcc 4.8.2) - note that you need to enable it explicitly here -
#include <emmintrin.h>
int main() {
long long int a[100];
__builtin_prefetch (&a[0], 0, 0);
__builtin_prefetch (&a[16], 0, 1);
__builtin_prefetch (&a[32], 0, 2);
__builtin_prefetch (&a[48], 0, 3);
__builtin_prefetch (&a[64], 1, 0);
return 0;
}
compiled with gcc -O3 -mprfchw prefetchw.c -c
, :
0000000000000000 <main>:
0: 48 81 ec b0 02 00 00 sub $0x2b0,%rsp
7: 48 8d 44 24 88 lea -0x78(%rsp),%rax
c: 0f 18 00 prefetchnta (%rax)
f: 0f 18 98 80 00 00 00 prefetcht2 0x80(%rax)
16: 0f 18 90 00 01 00 00 prefetcht1 0x100(%rax)
1d: 0f 18 88 80 01 00 00 prefetcht0 0x180(%rax)
24: 0f 0d 88 00 02 00 00 prefetchw 0x200(%rax)
2b: 31 c0 xor %eax,%eax
2d: 48 81 c4 b0 02 00 00 add $0x2b0,%rsp
34: c3 retq
If you play with the 2nd argument you'd notice that the hint levels are ignores for prefetchW, since it doesn't support temporal level hints. By the way, if you remove the -mprfchw flag, gcc will convert this into a normal read prefetch (I haven't tried different -march/mattr settings, maybe some of them include it as well).