c assembly cpu-architecture measurement cpu-cache

Is movndq works?

My task is to calculate RAM Read/Write speed. I using asm inserts to avoid compiler optimizations. To measure time I use TSC and CPU frequency. To move data I use asm instruction MOVNTDQ which doesn't use cache hierarchy.

Problem is in results. Data rate (by datasheet) is 800 Mbps, and I got by my test > 2000 Mbps write speed.

void memory_notCache_write_128(void* src, long blocks_amount) 
{
    _asm    
    {
        mov ecx, blocks_amount
        mov     esi, src
    a20:
        movntdq [esi], xmm0
        movntdq [esi + 16], xmm1
        movntdq [esi + 32], xmm2
        movntdq [esi + 48], xmm3
        movntdq [esi + 64], xmm4
        movntdq [esi + 80], xmm5
        movntdq [esi + 96], xmm6
        movntdq [esi + 112], xmm7
        add esi,  128
        loop    a20;
    }
}

int main()
{ 
    unsigned __int64 tick1, tick2;
    const long nBytes = 32*KByte;   

    char* source = (char*)_mm_malloc(nBytes*sizeof(char),16);

    tick1 =  getTicks();
    memory_notCache_write_128(source, current_times.t128);
    tick2 =  getTicks();

    double time = (double)(tick2-tick1)/(ProcSpeedCalc());
    cout << "Time WRITE_128[seconds]:" << time << endl;
    cout << (double) nBytes / time / MByte << endl;

    return 0;
}

Datasheet of RAM, that I used - http://www.alldatasheet.com/datasheet-pdf/pdf/308537/ELPIDA/EBE11UE6ACUA-8G-E.html

Source code (was written for Win patform): https://bitbucket.org/closed_eyes/ram_speed_for_win/downloads/memory_test.cpp

Solution

You shouldn't use non-temporal operations for this sort of code. The real way to build a memory performance tester is to use the access pattern to make sure that you never hit in the cache. Generally, this is done by looping over a very large chunk of memory that is bigger than the last level of cache in your system where your stride is the same as the cache line size. If you do this, you'll ensure that every access will be a cache miss in all levels. Don't forget though that when you read just one byte from memory, the processor will fetch a whole cache line, so if you do a 64-bit load, on a machine with a 64-byte cache line (very common), you should be counting 64-bytes as being read from memory.