Why does JMH report such strange times for a simple Quicksort --obviously disproportionate to N * log(N)?

Having an intent to study a sort algorithm (of my own), I decided to compare its performance with the classical quicksort and to my great surprise I've discovered that the time taken by my implementation of quicksort is far not proportional to N log(N). I thoroughly tried to find an error in my quicksort but unsuccessfully. It is a simple version of the sort algorithm working with arrays of Integer of different sizes, filled with random numbers, and I have no idea, where the error can sneak in. I have even counted all the comparisons and swaps executed by my code, and their number was rather fairly proportional to N log(N). I am completely confused and can't understand the reality I observe. Here are the benchkmark results for sorting arrays of 1,000, 2,000, 4,000, 8,000 and 16,000 random values (measured with JMH):

Benchmark                       Mode  Cnt       Score     Error  Units    2N / N ratio          
QSortBenchmarks.sortArray01000  avgt    5     561.505 ±   2.992  us/op              
QSortBenchmarks.sortArray02000  avgt    5    2433.307 ±  11.770  us/op    4.334 
QSortBenchmarks.sortArray04000  avgt    5    8510.448 ±  34.051  us/op    3.497 
QSortBenchmarks.sortArray08000  avgt    5   38269.492 ± 161.010  us/op    4.497 
QSortBenchmarks.sortArray16000  avgt    5  147132.524 ± 261.963  us/op    3.845

Obviously the time complexity I observe is very far from O(n log(n)), it's almost O(n^2). There may be a tiniest suspicion that the random seed has befallen so unlucky that the values in the arrays happened to be close to the worst case. The probability of this is very close to 0, yet not 0. But the results with a few different random seeds were very similar to that.

And here are the numbers of comparisons and swaps (for 40 iterations with randomly-filled arrays of each size):

                     Avr. Comparisons      Avr. Swaps                     2N / N ratio
sortArray01000():           12119.925,       2398.925             
sortArray02000():           26581.600,       5268.525                     2.193, 2.196
sortArray04000():           59866.925,      11451.625                     2.252, 2.174
sortArray08000():          127731.025,      25006.425                     2.134, 2.184
sortArray16000():          273409.925,      54481.525                     2.141, 2.179

As everyone can see, the numbers of operations conform the O(Nlog(N)) law rather well.

It is even possible to suspect that the cost of an individual operation depends on the size of the array we deal with, and I've checked if it's true (with a simple methods that reverse arrays of different sizes) -- and no, as expected, it's not so. The times were very close to O(n).

The only thing I can think of is there's some tricky logical error in my code, but I can't figure it out.

Can anybody help me?

Here's the code:

import java.io.IOException;
import java.util.Locale;
import java.util.Random;
import java.util.function.Consumer;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

/**
 * Why does quicksort take time disproportionate to N * log(N)?
 * Rectified for StackOverflow
 * 21.05.17 16:20:01
 */

@State(value = Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(java.util.concurrent.TimeUnit.MICROSECONDS)
@Fork(value = 2)
@Warmup(iterations = 5, time = 10)
@Measurement(iterations = 5, time = 10)
public class QSortBenchmarks {

  private static final int LOOPS_TO_ITERATE = 40; // 40;

  private static final int RAND_SEED = 123456789;
  private static final int RAND_LIMIT = 10000;
  private static final Random RAND = new Random(RAND_SEED); // Constant seed for reproducibility;

  private static int cmpCount = 0, swapCount = 0;
  private static int cmpTotal = 0, swapTotal = 0;

  private static final Integer[] array01000 = new Integer[1000];
  private static final Integer[] array02000 = new Integer[2000];
  private static final Integer[] array04000 = new Integer[4000];
  private static final Integer[] array08000 = new Integer[8000];
  private static final Integer[] array16000 = new Integer[16000];

  @Setup
  public static void initData() {
    cmpCount = 0; swapCount = 0;

    fillWithRandoms(array01000);
    fillWithRandoms(array02000);
    fillWithRandoms(array04000);
    fillWithRandoms(array08000);
    fillWithRandoms(array16000);
  }

  public static void main(String[] args) throws IOException, RunnerException {
    Locale.setDefault(Locale.US);
    initData();

    runJMH(); // Run benchmarks. Comment-out it, if you want just to count comparisons etc.
//    System.exit(0); // If don't want to count comparisons and swaps

    System.out.printf("\nRand seed = %d, rand limit = %d, iterations = %d\n",
                      RAND_SEED, RAND_LIMIT, LOOPS_TO_ITERATE);

    System.out.print("sortArray01000(): ");
    loopOverMethod(qq -> sortArray01000());

    System.out.print("sortArray02000(): ");
    loopOverMethod(qq -> sortArray02000());

    System.out.print("sortArray04000(): ");
    loopOverMethod(qq -> sortArray04000());

    System.out.print("sortArray08000(): ");
    loopOverMethod(qq -> sortArray08000());

    System.out.print("sortArray16000(): ");
    loopOverMethod(qq -> sortArray16000());
  }

  private static void loopOverMethod(Consumer<Object> method) {
    cmpTotal = 0; swapTotal = 0;
    for (int loops = 0; loops < LOOPS_TO_ITERATE; loops++ ) {
      initData();
      method.accept(null);;
      cmpTotal += cmpCount; swapTotal += swapCount;
    }
    System.out.printf("avrg compares: %12.3f,  swaps: %12.3f\n",
                      (double)cmpTotal / LOOPS_TO_ITERATE, (double)swapTotal / LOOPS_TO_ITERATE);
  }

  /**
   * @throws RunnerException
   */
  private static void runJMH() throws RunnerException {
    final Options opt = new OptionsBuilder()
        .include(QSortBenchmarks.class.getSimpleName())
        .forks(1)
        .build();
    new Runner(opt).run();
  }

  private static void fillWithRandoms(Integer[] array) {
    for (int i = 0; i < array.length; i++) { // // Fill it with LIST_SIZE random values
      array[i] = (int)(RAND.nextDouble() * RAND_LIMIT);
    }
  }

  @Benchmark
  public static void sortArray01000() {
    final Integer[] array = array01000;
    quickSort(array, 0, array.length - 1);
  }

  @Benchmark
  public static void sortArray02000() {
    final Integer[] array = array02000;
    quickSort(array, 0, array.length - 1);
  }

  @Benchmark
  public static void sortArray04000() {
    final Integer[] array = array04000;
    quickSort(array, 0, array.length - 1);
  }

  @Benchmark
  public static void sortArray08000() {
    final Integer[] array = array08000;
    quickSort(array, 0, array.length - 1);
  }

  @Benchmark
  public static void sortArray16000() {
    final Integer[] array = array16000;
    quickSort(array, 0, array.length - 1);
  }

  private static void quickSort(Integer[] array, int lo, int hi) {
    if (hi <= lo) return;
    final int j = partition(array, lo, hi);
    quickSort(array, lo, j - 1);
    quickSort(array, j + 1, hi);
  }

  private static int partition(Integer[] array, int lo, int hi) {
    int i = lo, j = hi + 1;
    while (true) {
      while (compare(array[++i], array[lo]) < 0) // while (array[++i] < array[lo])
        if (i == hi) break;
      while (compare(array[lo], array[--j]) < 0) // while (array[lo] < array[--j])
        if (j == lo) break;
      if (i >= j) break;
      swapItems(array, i, j);
    }
    swapItems(array, lo, j);
    return j;
  }

  private static int compare(Integer v1, Integer v2) {
    cmpCount++;
    return v1.compareTo(v2);
  }

  private static void swapItems(Integer[] array, int i, int j) {
    swapCount++;
    final Integer tmp = array[i];
    array[i] = array[j];
    array[j] = tmp;
  }

}

Update 21-05-18 10:08Z

The situation has cleared up. It was JMH that showed such strange results that had nothing in common with the actual execution times. A simple hand-made benchmarking with System.nanoTime(), for arrays of 10,000, 20,000, 40,000, 80,000, and 160,000 items, showed the following times, :

Rand seed = 123, rand limit = 1000000, iterations = 1000
sortArray_010k(): avrg time:    1049.538 mks/op
sortArray_020k(): avrg time:    2282.189 mks/op
sortArray_040k(): avrg time:    4976.079 mks/op
sortArray_080k(): avrg time:   10762.461 mks/op
sortArray_160k(): avrg time:   23115.345 mks/op

while the JMH results for the same arrays are as following:

Benchmark                       Mode  Cnt      Score     Error  Units
QSortBenchmarks.sortArray_010k  avgt    5    109.454 ±   0.444  ms/op
QSortBenchmarks.sortArray_020k  avgt    5    373.518 ±  17.439  ms/op
QSortBenchmarks.sortArray_040k  avgt    5   1350.420 ±  26.733  ms/op
QSortBenchmarks.sortArray_080k  avgt    5   6519.015 ±  48.770  ms/op
QSortBenchmarks.sortArray_160k  avgt    5  26837.697 ± 926.132  ms/op

The figures shown by the hand-made benchmark conform the N log(N) quite well, look realistic, and are in good agreement with the observed execution time in seconds. And they are from 100 to more than 1000 times less than the figures shown by JMH.

Here's the modified loopOverMethod() with which they are obtained:

  private static void loopOverMethod(Consumer<Object> method) {
    for (int loops = 0; loops < 100; loops++ ) { // Kinda warmup
      initData();
      method.accept(null);
    }

    long time = 0;
    cmpTotal = 0; swapTotal = 0;
    for (int loops = 0; loops < LOOPS_TO_ITERATE; loops++ ) {
      initData();
      time -= System.nanoTime();
      method.accept(null);
      time += System.nanoTime();
      cmpTotal += cmpCount; swapTotal += swapCount;
    }
    System.out.printf("avrg time: \t%10.3f mks\n", 
                      time * 1e-3 / LOOPS_TO_ITERATE);
  }

But now another question arises. Why does JMH shows such strange results? What is wrong in the way I use it? And obviously I have to revise other project where I use JMH. Rather an unpleasant news.

Update 21-05-19 10:08Z

Put it here as an update for the question, since comments don't allow to inser quotes. The comprehensive answer given by Thomas Kleger is absolutely correct. After following his suggestions, I got the following results:

JMH:

Benchmark                       Mode  Cnt      Score    Error  Units
QSortBenchmarks.sortArray_010k  avgt    5    975.028 ± 23.907  us/op
QSortBenchmarks.sortArray_020k  avgt    5   2253.627 ± 94.108  us/op
QSortBenchmarks.sortArray_040k  avgt    5   4836.680 ± 80.964  us/op
QSortBenchmarks.sortArray_080k  avgt    5  10041.063 ± 27.796  us/op
QSortBenchmarks.sortArray_160k  avgt    5  21232.223 ± 32.008  us/op

Hand-made benchmarks:

Rand seed = 123, rand limit = 1000000, iterations = 1000
sortArray_010k(): avrg time:    1060.951 mks
sortArray_020k(): avrg time:    2296.533 mks
sortArray_040k(): avrg time:    5021.629 mks
sortArray_080k(): avrg time:   10855.963 mks
sortArray_160k(): avrg time:   23335.923 mks

Now the results look quite plausible and I'm completely satisfied with them.

Solution

Three points work together against your implementation:

Quicksort has a worst case complexity of O(n^2)
Picking the leftmost element as pivot gives worst case behavior on already sorted arrays (https://en.wikipedia.org/wiki/Quicksort#Choice_of_pivot):

In the very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays

Your algorithm sorts the arrays in place, meaning that after the first pass the "random" array is sorted. (To calculate average times JMH does several passes over the data).

To fix this, you could change your benchmark methods. For example, you could change sortArray01000() to

@Benchmark
public static void sortArray01000() {
  final Integer[] array = Arrays.copyOf(array01000, array01000.length);
  quickSort(array, 0, array.length - 1);
}

Or you could modify the @Setup annotation so that it is executed before each invocation of the benchmark method:

@Setup(Level.Invocation)
public static void initData() {
    //...
}

The @Setup annotation takes a parameter that determines when the setup method is executed.

The three levels are (https://hg.openjdk.java.net/code-tools/jmh/file/2be2df7dbaf8/jmh-core/src/main/java/org/openjdk/jmh/annotations/Level.java):

Level.Trial: before each benchmark
Level.Iteration: before each iteration
Level.Invocation: before each execution of the benchmark method

The default level is Level.Trial (https://hg.openjdk.java.net/code-tools/jmh/file/2be2df7dbaf8/jmh-core/src/main/java/org/openjdk/jmh/annotations/Setup.java#l54).

What does that mean for your test?

To understand this you have to understand how JMH executes your benchmark:

It starts a trial for one of your benchmark methods
during this trial it does 5 warm-up iterations and 5 measurement iterations
during each iteration it calls your benchmark method in a tight loop until 10 seconds have passed - if your benchmark method takes 500us that means it will be called around 20'000 times during each iteration, or about 200'000 times during full trial

Now with the @Setup(Level.Trial) and a benchmark method that sorts the input data in place that means that only the very first call of the quicksort method can show the O(N log(N)) behavior, all remaining calls operate on the already sorted array and show the worst case behavior of O(N^2).

With @Setup(Level.Iteration) the situation is still not much better - now it is the first call of the benchmark method in each iteration that has O(N log(N)) behavior, the remaining ~20'000 calls per iteration still show O(N^2).

With @Setup(Level.Invocation) finally each invocation of the benchmark method (and therefore each invocation of the quicksort) gets its own, unsorted array as an input, and this clearly shows in the results:

@Setup(Level.Trial)

1000: 780 us
2000: 3300 us


@Setup(Level.Iteration)

1000: 780 us
2000: 3280 us
4000: 11700 us


@Setup(Level.Invocation)

1000: 58 us
2000: 124 us
4000: 280 us

With my proposed change (copying the input array in the benchmark method) I get slightly better results, but these could be due to caching effects:

1000: 25 us
2000: 108 us
4000: 260 us