Effective main memory to CPU maximum bandwidth in C#

I want to write a C# program capable of running basic operations on data read from the main memory so that can I can get as close as possible to the main memory read bandwidth.

I guess we can be sure the cache is not used when using very large arrays. So far, using multiple threads and long[] I've never been able to cross the 2 GB/s seconds limit while I know modern RAM bandwidth is more like 10 GB/s at least. (I have a modern computer and run in 64 bits, release mode without debugging of course).

Can you provide a C# program capable of getting close to the maximum bandwidth? If not could you explain why a C# program can't do it?

For example:

Prepare: create a (several?) large array and fill it with random numbers
Main step: sum (or any low CPU operation) all the elements in the array

Solution

Assuming you mean single-threaded bandwidth, that's fairly easy, for example like this:

uint[] data = new uint[10000000 * 32];
for (int j = 0; j < 15; j++)
{
    uint sum = 0;
    var sw = Stopwatch.StartNew();
    for (uint i = 0; i < data.Length; i += 64)
    {
        sum += data[i] + data[i + 16] + data[i + 32] + data[i + 48];
    }
    sw.Stop();
    long dataSize = data.Length * 4;
    Console.WriteLine("{0} {1:0.000} GB/s", sum, dataSize / sw.Elapsed.TotalSeconds / (1024 * 1024 * 1024));
}

On my machine I get around 19.8-20.1 GB/s from this, and I know the single-threaded bandwidth is supposed to be around 20 GB/s so that seems fine. The multithreaded bandwidth on my machine is actually higher, around 30 GB/s, but that would take a more complex test that coordinates at least two threads.

Some tricks are necessary in this benchmark. Most importantly, I rely on a cache line size of 64 bytes to be able to skip doing anything with most of the data. Since the code does touch every cache line (minus, perhaps, one or two at the start and end due to the array not necessarily being 64-aligned), the entire array will be transferred from memory. Just in case it mattered (it did change the results a little, so I kept it) I unrolled the loop by 4, and made the index variable unsigned to avoid pointless movsx instructions. Saving operations is, especially with scalar code like this, important in order to try to avoid making that the bottleneck, rather than memory bandwidth.

However, this does not really benchmark the total memory bandwidth the system has available, which on my system is not possible from a single core. There are certain microarchitectural details that can limit memory bandwidth to a single core to be less than the total memory bandwidth the whole processor has. You can read about various details in this answer by BeeOnRope.