How to ensure that RDTSC is accurate?

I've read that RDTSC can gives false readings and should not be relied upon.
Is this true and if so what can be done about it?

Solution

Very old CPU's have a RDTSC that is accurate.

The problem

However newer CPU's have a problem.
Engineers decided that RDTSC would be great for telling time.
However if a CPU throttles the frequency RDTSC is useless for telling time.
The aforementioned braindead engineers then decided to 'fix' this problem by having the TSC always run at the same frequency, even if the CPU slows down.

This has the 'advantage' that TSC can be used for telling elapsed (wall clock) time. However it makes the TSC ~~useless~~ less useful for profiling.

How to tell if your CPU is not broken

You can tell if your CPU is fine by reading the TSC_invariant bit in the CPUID.

Set EAX to 80000007H and read bit 8 of EDX.
If it is 0 then your CPU is fine.
If it's 1 then your CPU is broken and you need to make sure you profile whilst running the CPU at full throttle.

function IsTimerBroken: boolean;
{$ifdef CPUX86}
asm
  //Make sure RDTSC measure CPU cycles, not wall clock time.
  push ebx
  mov eax,$80000007  //Has TSC Invariant support?
  cpuid
  pop ebx
  xor eax,eax        //Assume no
  and edx,$10        //test TSC_invariant bit
  setnz al           //if set, return true, your PC is broken.
end;
{$endif}
  //Make sure RDTSC measure CPU cycles, not wall clock time.
{$ifdef CPUX64}
asm
  mov r8,rbx
  mov eax,$80000007  //TSC Invariant support?
  cpuid
  mov rbx,r8
  xor eax,eax
  and edx,$10 //test bit 8
  setnz al
end;
{$endif}

How to fix out of order execution issues

See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

Use the following code:

function RDTSC: int64;
{$IFDEF CPUX64}
asm
  {$IFDEF AllowOutOfOrder}
  rdtsc
  {$ELSE}
  rdtscp        // On x64 we can use the serializing version of RDTSC
  push rbx      // Serialize the code after, to avoid OoO sneaking in
  push rax      // later instructions before the RDTSCP runs.
  push rdx      // See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
  xor eax,eax
  cpuid
  pop rdx
  pop rax
  pop rbx
  {$ENDIF}
  shl rdx,32
  or rax,rdx
  {$ELSE}
{$IFDEF CPUX86}
asm
  {$IFNDEF AllowOutOfOrder}
  xor eax,eax
  push ebx
  cpuid         // On x86 we can't assume the existance of RDTSP
  pop ebx       // so use CPUID to serialize
  {$ENDIF}
  rdtsc
  {$ELSE}
error!
{$ENDIF}
{$ENDIF}
end;

How to run RDTSC on a broken CPU

The trick is to force the CPU to run at 100%.
This is usually done by running the sample code many many times.
I usually use 1.000.000 to start with.
I then time those 1 million runs 10x and take the lowest time of those attempts.

Comparisons with theoretical timings show that this gives very accurate results.