This question is about DRAM speeds and memory interleaving. I have a very specific problem. I am using a power based architecture board (minus the AltiVec) and I wish to copy a large segment of memory (virtual contiguous) between two regions within my process' address space. To offset the slowness of my core, I affixed two threads to two cpu's and that made copy a lot faster.
However that was still not fast enough. so I added a third thread, and it made no difference to copy times whatsoever. I did more research on this and found that my board was equipped with a single DDR3 RAM (speed 1600 MB/s) and it was pretty close to max attainable speeds already.
[ Some explanation here: With just 2 threads, I am copying, say 5500 pages of size 4K in around 16.5 milliseconds. If you do a simple calculation, it would seem that the minimum time in theory that you could clock (bar all prefetches and stuff) is 13.75 milliseconds. ]
I discovered that I could add an extra RAM to my board. Which I could possibly get my co. to fund by telling them I also intend to halve the size of each stick of memory, but how can I get the kernel to allocate me memory that is guaranteed to be evenly distributed across both memories?
Thanks a lot for answering!
P.s. I am using linux kernel version 2.6.34.
See if your Linux / board combination supports the NUMA (Non-uniform memory access) extensions. You can specify interleaving policies through libnuma:
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
Available policies are page interleaving (i.e., allocate in a round-robin fashion from all, or a subset, of the nodes on the system), preferred node allocation (i.e., preferably allocate on a particular node), local allocation (i.e., allocate on the node on which the task is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). It is also possible to bind tasks to specific nodes.