rdtsc timing for a measuring a function

I want to time a function call with rdtsc. So I measured it in two ways as follows.

Call it in a loop. Aggregate each rdtsc difference within the loop and divide by number of calls. (Let's say this is N)
Call it in a loop. Get the rdtsc difference of the loop itself and divide by N.

But I see couple of inconsistent behaviors.

When I increase N the times get reduced rather monotonically in both method 1 and 2. For method 2 it is understandable in that it would amortize the loop control overhead. But I am not sure how it is so for method 1.
Actually for method 2 each time when I increase the N the value I get for N=1 seems to be just divided by the new N each time. Inspecting gdb disassembly made me realize it is some compiler optimization at -O2 where loop is skipped at second case. So I retried with -O0, where the gdb disassembly show the actual loop being there for the second case as well.

Code is given below.

    #include <stdio.h>
    #include <inttypes.h>
    #include <stdlib.h>

    typedef unsigned long long ticks;

    static __inline__ ticks getticks(void) {
      unsigned a, d; 
      asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
      return ((ticks)a) | (((ticks)d) << 32); 
    }

    __attribute__ ((noinline))
    void bar() {

    }

    int main(int argc, char** argv) {

       long long N = 1000000; 
       N = atoi(argv[1]);
       int i;
       long long bar_total = 0;

       ticks start = 0, end = 0;

       for (i = 0; i < N; i++) {
         start = getticks();
         bar();
         end = getticks();
         bar_total += (end - start);
       } 

       fprintf(stdout, "Total invocations : %lld\n", N);
       fprintf(stdout, "[regular] bar overhead : %lf\n", ((double)bar_total/  N));

      start = getticks();
      for (i = 0; i < N; i++) {
        bar();
      } 
      end = getticks();

      bar_total = (end - start);

      fprintf(stdout, "[Loop] bar overhead : %lf\n", ((double)bar_total/ N));

      return 0;

     }

Any idea what's going on here? I can put the gdb disassembly if needed as well. I used the rdtsc implementation from http://dasher.wustl.edu/tinker/distribution/fftw/kernel/cycle.h

Edit: I am going to have to retract my second statement that at -O0 the time gets dropped directly proportional to N in the second case. I guess it's some mistake I made during the build causing some older version to persist. Any how it still goes down somewhat along with figure for method 1. Here are some numbers for different N values.

taskset -c 2 ./example.exe 1
Total invocations : 1
[regular] bar overhead : 108.000000
[Loop] bar overhead : 138.000000

taskset -c 2 ./example.exe 10
Total invocations : 10
[regular] bar overhead : 52.900000
[Loop] bar overhead : 40.700000

taskset -c 2 ./example.exe 100
Total invocations : 100
[regular] bar overhead : 46.780000
[Loop] bar overhead : 15.570000

taskset -c 2 ./example.exe 1000
Total invocations : 1000
[regular] bar overhead : 46.069000
[Loop] bar overhead : 13.669000

taskset -c 2 ./example.exe 100000
Total invocations : 10000
[regular] bar overhead : 46.010100
[Loop] bar overhead : 13.444900

taskset -c 2 ./example.exe 100000000
Total invocations : 100000000
[regular] bar overhead : 26.970272
[Loop] bar overhead : 5.201252

taskset -c 2 ./example.exe 1000000000
Total invocations : 1000000000
[regular] bar overhead : 18.853279
[Loop] bar overhead : 5.218234

taskset -c 2 ./example.exe 10000000000
Total invocations : 1410065408
[regular] bar overhead : 18.540719
[Loop] bar overhead : 5.216395

I see two new behaviors now.

Method 1 converges slower than the method 2. But still I am puzzling over why there is such a drastic difference in values for different N settings. Perhaps I am doing some basic mistake here which I don't see at the moment.
Method 1 value is actually larger than method 2 by some margin. I expected it be on par or slightly smaller than the method 2 value since it doesn't contain loop control overhead.

Questions

So in summary my questions are

Why are the values given by both methods change so drastically when increasing the N? Specially for method 1 which doesn't account for loop control overhead.
Why is second method result is less than the first method's when first method excludes the loop control overhead in the calculations?

Edit 2

Regarding the suggested rdtscp solution.

Being uninitiated about the inline assembly I did the following.

static __inline__ ticks getstart(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile ("CPUID\n\t"
             "RDTSC\n\t"
             "mov %%edx, %0\n\t"
             "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
             "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}

static __inline__ ticks getend(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile("RDTSCP\n\t"
         "mov %%edx, %0\n\t"
          "mov %%eax, %1\n\t"
           "CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low)::
           "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}

and used above methods before and after the function call. But now I get non sensical results like follows.

Total invocations : 1000000
[regular] bar overhead : 304743228324.708374
[Loop] bar overhead : 33145641307.734016

What's the catch? I wanted to factor out those as inlined methods since I see use of it in multiple places.

A. Solution in the comments.

Solution

You use plain rdtsc instruction, which may not work correctly on Out-of-order CPUs, like Xeons and Cores. You should add some serializing instruction or switch to rdtscp instruction:

http://en.wikipedia.org/wiki/Time_Stamp_Counter

Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.[3] This problem can be solved by executing a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.

Intel has recent manual of using rdtsc/rdtscp - How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (ia-32-ia-64-benchmark-code-execution-paper.pdf, 324264-001, 2010). They recommend cpuid+rdtsc for start and rdtscp for end timers:

The solution to the problem presented in Section 0 is to add a CPUID instruction just after the RDTPSCP and the two mov instructions (to store in memory the value of edx and eax). The implementation is as follows:

asm volatile ("CPUID\n\t"
 "RDTSC\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
/***********************************/
/*call the function to measure here*/
/***********************************/
asm volatile("RDTSCP\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t"
 "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
"%rax", "%rbx", "%rcx", "%rdx");

start = ( ((uint64_t)cycles_high << 32) | cycles_low );
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );

In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. Nevertheless, this call does not affect the measurement since it comes before the RDTSC (i.e., before the timestamp register is read). The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. If the code is a call to a function, it is recommended to declare such function as “inline” so that from an assembly perspective there is no overhead in calling the function itself. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed.

You example is not very correct; you try to measure empty function bar(), but it is so short that you are measuring rdtsc overhead in method 1 (for() { rdtsc; bar(); rdtsc)). According to the Agner Fog's table for haswell - http://www.agner.org/optimize/instruction_tables.pdf page 191 (long table "Intel Haswell List of instruction timings and μop breakdown", at the very end of it) RDTSC has 15 uops (no fusion possible) and the latency of 24 ticks; RDTSCP (for older microarchitecture Sandy Bridge has 23 uops and 36 ticks latency versus 21 uops and 28 ticks for rdtsc). So, you can't use plain rdtsc (or rdtscp) to directly measure such short code.