My task is to calculate RAM Read/Write speed. I using asm inserts to avoid compiler optimizations. To measure time I use TSC and CPU frequency. To move data I use asm instruction MOVNTDQ which doesn't use cache hierarchy.
Problem is in results. Data rate (by datasheet) is 800 Mbps, and I got by my test > 2000 Mbps write speed.
void memory_notCache_write_128(void* src, long blocks_amount)
{
_asm
{
mov ecx, blocks_amount
mov esi, src
a20:
movntdq [esi], xmm0
movntdq [esi + 16], xmm1
movntdq [esi + 32], xmm2
movntdq [esi + 48], xmm3
movntdq [esi + 64], xmm4
movntdq [esi + 80], xmm5
movntdq [esi + 96], xmm6
movntdq [esi + 112], xmm7
add esi, 128
loop a20;
}
}
int main()
{
unsigned __int64 tick1, tick2;
const long nBytes = 32*KByte;
char* source = (char*)_mm_malloc(nBytes*sizeof(char),16);
tick1 = getTicks();
memory_notCache_write_128(source, current_times.t128);
tick2 = getTicks();
double time = (double)(tick2-tick1)/(ProcSpeedCalc());
cout << "Time WRITE_128[seconds]:" << time << endl;
cout << (double) nBytes / time / MByte << endl;
return 0;
}
Datasheet of RAM, that I used - http://www.alldatasheet.com/datasheet-pdf/pdf/308537/ELPIDA/EBE11UE6ACUA-8G-E.html
Source code (was written for Win patform): https://bitbucket.org/closed_eyes/ram_speed_for_win/downloads/memory_test.cpp
You shouldn't use non-temporal operations for this sort of code. The real way to build a memory performance tester is to use the access pattern to make sure that you never hit in the cache. Generally, this is done by looping over a very large chunk of memory that is bigger than the last level of cache in your system where your stride is the same as the cache line size. If you do this, you'll ensure that every access will be a cache miss in all levels. Don't forget though that when you read just one byte from memory, the processor will fetch a whole cache line, so if you do a 64-bit load, on a machine with a 64-byte cache line (very common), you should be counting 64-bytes as being read from memory.