c#.net performance bounds-check-elimination

Array bounds check efficiency in .net 4 and above

I'm interested in how efficient low-level algorithms can be in .net. I would like to enable us to choose to write more of our code in C# rather than C++ in the future, but one stumbling block is the bounds checking in .net that occurs with looping and random access to arrays.

A motivating example is a function that calculates the sum of products of corresponding elements in two arrays (this is the dot product of two vectors).

static void SumProduct(double[] X, double[] Y)
{
    double sum = 0;
    int length = X.Length;
    if (length != Y.Length)
        throw new ArgumentException("X and Y must be same size");
    for (int i = 0; i < length; i++) // Check X.Length instead? See below
        sum += X[i] * Y[i];
}

From what I can tell, and don't know enough IL or x86 to check, the compiler won't optimize out bounds checking of X and Y. Am I wrong and/or is there a way to write my code to allow the compiler to help me out?

Further details

There are many efficiency-arguments for and against using particular languages, not least that it is better to concentrate on "big O" algorithmic cost rather than the constant of proportionality, and higher level languages help you to do this. On the subject of bounds checking in .net, the best article I found is Array Bounds Check Elimination in the CLR on MSDN (also referenced in a stack overflow answer on the importance of enabling optimization).

This dates from 2009, so I wonder whether things have changed significantly since then. Also, the article reveals some real subtleties that would have caught me out so for this reason alone I would welcome some expert advice.

For example it appears that in my code above I would have better off writing i< X.Length rather than i < length. Also, I had also naively assumed that for an algorithm with a single array, writing a foreach loop would better declare your intent to the compiler and give it the best chance of optimizing out the bounds checking.

According to the MSDN article, SumForBAD, below, which I thought was sure to be optimized, would not be. Whereas SumFor would be straightforwardly optimized, and SumForEach would also be optimized, but not trivially (and might not be optimized at all if the array were passed into a function as IEnumerable<int>)?

static double SumForBAD(double[] X)
{
    double sum = 0;
    int length = X.Length; // better to use i < X.length in loop
    for (int i = 0; i < length; i++)
        sum += X[i];
    return sum;
}

static double SumFor(double[] X)
{
    double sum = 0;
    for (int i = 0; i < X.Length; i++)
        sum += X[i];
    return sum;
}

static double SumForEach(double[] X)
{
    double sum = 0;
    foreach (int element in X)
        sum += element;
    return sum;
}

I did some investigation based on doug65536's answer. In C++, I compared the times of a SumProduct that does one bounds-check

for(int i=0; i<n; ++i) sum += v1[i]*v2[i];

against another version that does two bounds-checks

for(int i=0; i<n1 && i <n2; ++i) sum += v1[i]*v2[i];

I found that the second version was slower, but only by about 3.5% (Visual Studio 2010, optimized build, default options). However it occurred to me that in C#, there might be three bounds checks. One explicit (i < length in the function static void SumProduct(double[] X, double[] Y) at the start of this question), and two implicit (X[i] and Y[i]). So I tested a third C++ function, with three bounds checks

for(int i=0; i<n1 && i <n2 && i <n3; ++i) sum += v1[i]*v2[i];

This came in 35% slower than the first, which is worth caring about. I did some more investigation in this question, Why does adding extra check in loop make big difference on some machines, and small difference on others?. Interestingly, it seems as though the cost of bounds checking varies significantly on different machines.

Solution

The bounds check won't matter because:

The bounds check consists of a cmp/jae instruction pair, which is fused into a single micro-op on modern CPU architectures (the term is "macro-op fusion"). Compare and branch is very highly optimized.
The bounds check is a forward branch, which will be statically predicted to be not-taken, also reducing the cost. The branch will never be taken. (If it ever is taken, an exception will throw anyway, so the mispredict cost becomes utterly irrelevant)
As soon as there is any memory delay, speculative execution will queue up many iterations of the loop, so the cost of decoding the extra instruction pair almost disappears.

Memory access will likely be your bottleneck, so the effect micro-optimizations like removing bounds checks will disappear.