I expect the following program to be completely memory-bound in terms of performance (the arrays are way bigger than the L3-cache).
Therefor I expected the sum of the long array to take almost twice as much time as the sum of the int array.
But they take both almost the same time:
int sum took 81 ms, result = 4999999950000000
long sum took 87 ms, result = 4999999950000000
Can anybody explain this?
using System;
using System.Diagnostics;
using System.Linq;
namespace MemoryPerformance
{
class Program
{
static void Main(string[] args)
{
const int count = 100_000_000;
int[] intArray = Enumerable.Range(0, count).ToArray();
long[] longArray = intArray.Select(x => (long)x).ToArray();
Measure(() => intSum(intArray), " int sum");
Measure(() => longSum(longArray), "long sum");
}
static long intSum(int[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i++) sum += array[i];
return sum;
}
static long longSum(long[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i++) sum += array[i];
return sum;
}
static void Measure(Func<long> calc, string description)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
long sum = calc();
stopwatch.Stop();
Console.WriteLine($"{description} took {stopwatch.ElapsedMilliseconds} ms, result = {sum}");
}
}
}
If I run this several times I get about the same results, but worse: (printing of the result added just in case)
int sum took 84 ms (4999999950000000)
long sum took 77 ms (4999999950000000)
int sum took 84 ms (4999999950000000)
long sum took 76 ms (4999999950000000)
int sum took 84 ms (4999999950000000)
long sum took 77 ms (4999999950000000)
int sum took 83 ms (4999999950000000)
long sum took 76 ms (4999999950000000)
So with longs it's faster? One this that it is not doing that the int
version is, is sign extension. It could be that. Actually I don't know what else it could be.
But this is just what happens when all elements of the array are added. If I take only every 8th element (8th because cache lines are 64 bytes and longs are 8, so 8 fit in a cache line), this happens:
int sum took 25 ms (624999950000000)
long sum took 49 ms (624999950000000)
int sum took 23 ms (624999950000000)
long sum took 49 ms (624999950000000)
int sum took 23 ms (624999950000000)
long sum took 48 ms (624999950000000)
int sum took 23 ms (624999950000000)
long sum took 48 ms (624999950000000)
This is very different, with indeed the int
version being about twice as fast as the long
version, corresponding with the number of cache misses expected for both versions.
So I may conclude from this that in the "full" version apparently enough arithmetic (or at least "stuff that isn't the memory access", including loop overhead) happens to hide most of the cache miss penalty, combined with actually doing less work in the long
version.
Also I think we should keep in mind that since this is a completely linear access pattern, it should be expected that hardware prefetching does a good job. The throughput of automatic prefetching may or may not be adequate, but it shouldn't be that bad of a case - it is not unreasonable that doing a bit of computation allows prefetching to "catch up".
Relevant code that uses only every 8th element:
static long intSum(int[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i += 8) sum += array[i];
return sum;
}
static long longSum(long[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i += 8) sum += array[i];
return sum;
}
Timing on ideone