How is this virtual method call faster than the sealed method call?

I am doing some tinkering on the performance of virtual vs sealed members.

Below is my test code.

The output is

virtual total 3166ms
per call virtual 3.166ns
sealed total 3931ms
per call sealed 3.931ns

I must be doing something wrong because according to this the virtual call is faster than the sealed call.

I am running in Release mode with "Optimize code" turned on.

Edit: when running outside of VS (as a console app) the times are close to a dead heat. but the virtual almost always comes out in front.

[TestFixture]
public class VirtTests
{

    public class ClassWithNonEmptyMethods
    {
        private double x;
        private double y;

        public virtual void VirtualMethod()
        {
            x++;
        }
        public void SealedMethod()
        {
            y++;
        }
    }

    const int iterations = 1000000000;


    [Test]
    public void NonEmptyMethodTest()
    {

        var foo = new ClassWithNonEmptyMethods();
        //Pre-call
        foo.VirtualMethod();
        foo.SealedMethod();

        var virtualWatch = new Stopwatch();
        virtualWatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            foo.VirtualMethod();
        }
        virtualWatch.Stop();
        Console.WriteLine("virtual total {0}ms", virtualWatch.ElapsedMilliseconds);
        Console.WriteLine("per call virtual {0}ns", ((float)virtualWatch.ElapsedMilliseconds * 1000000) / iterations);


        var sealedWatch = new Stopwatch();
        sealedWatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            foo.SealedMethod();
        }
        sealedWatch.Stop();
        Console.WriteLine("sealed total {0}ms", sealedWatch.ElapsedMilliseconds);
        Console.WriteLine("per call sealed {0}ns", ((float)sealedWatch.ElapsedMilliseconds * 1000000) / iterations);

    }

}

Solution

You are testing the effects of memory alignment on code efficiency. The 32-bit JIT compiler has trouble generating efficient code for value types that are more than 32 bits in size, long and double in C# code. The root of the problem is the 32-bit GC heap allocator, it only promises alignment of allocated memory on addresses that are a multiple of 4. That's an issue here, you are incrementing doubles. A double is efficient only when it is aligned on an address that's a multiple of 8. Same issue with the stack, in case of local variables, it is also aligned only to 4 on a 32-bit machine.

The L1 CPU cache is internally organized in blocks called a "cache line". There is a penalty when the program reads a mis-aligned double. Especially one that straddles the end of a cache line, bytes from two cache lines have to be read and glued together. Mis-alignment isn't uncommon in the 32-bit jitter, it is merely 50-50 odds that the 'x' field happens to be allocated on an address that's a multiple of 8. If it isn't then 'x' and 'y' are going to be misaligned and one of them may well straddle the cache line. The way you wrote the test, that's going to either make VirtualMethod or SealedMethod slower. Make sure you let them use the same field to get comparable results.

The same is true for code. Swap the code for the virtual and sealed test to arbitrarily change the outcome. I had no trouble making the sealed test quite a bit faster that way. Given the modest difference in speed, you are probably looking at a code alignment issue. The x64 jitter makes an effort to insert NOPs to get a branch target aligned, the x86 jitter doesn't.

You should also run the timing test several times in a loop, at least 20. You are likely to then also observe the effect of the garbage collector moving the class object. The double may have a different alignment afterward, dramatically changing the timing. Accessing a 64-bit value type value like long or double has 3 distinct timings, aligned on 8, aligned on 4 within a cache line, and aligned on 4 across two cache lines. In fast to slow order.

The penalty is steep, reading a double that straddles a cache line is roughly three times slower than reading an aligned one. Also the core reason why a double[] (array of doubles) is allocated in the Large Object Heap even when it has only 1000 elements, well south of the normal threshold of 80KB, the LOH has an alignment guarantee of 8. These alignment problems entirely disappear in code generated by the x64 jitter, both the stack and the GC heap have an alignment of 8.