java algorithm math apache-commons percentile

Apache Commons Math 2.2 Percentile bug?

I am not 100% sure if this is a bug or I am not doing something right but if you give Percentile a large amount of data that is the consistent of the same value (see code below) the evaluate method takes a very long time. If you give Percentile the random values evaluate takes a considerable shorter time.

As noted below Median is a subcalss of Percentile.

Percentile java doc

private void testOne(){
  int size = 200000;
  int sameValue = 100;
  List<Double> list = new ArrayList<Double>();

  for (int i = 0; i < size; i++)
  {
    list.add((double)sameValue);
  }
  Median m = new Median();
  m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));

  long start = System.currentTimeMillis();
  System.out.println("Start:"+ start);

  double result = m.evaluate();

  System.out.println("Result:" + result);
  System.out.println("Time:"+ (System.currentTimeMillis()- start));
}


private void testTwo(){
  int size = 200000;
  List<Double> list = new ArrayList<Double>();

  Random r = new Random();

  for (int i = 0; i < size; i++)
  {
    list.add(r.nextDouble() * 100.0);
  }
  Median m = new Median();
  m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));

  long start = System.currentTimeMillis();
  System.out.println("Start:"+ start);

  double result = m.evaluate();

  System.out.println("Result:" + result);
  System.out.println("Time:"+ (System.currentTimeMillis()- start));
}

Solution

This is a known issue between versions 2.0 and 2.1 and has been fixed for version 3.1.

Version 2.0 did indeed involve sorting the data, but in 2.1 they seemed to have switched to a selection algorithm. However, a bug in their implementation of that led to some bad behavior for data with lots of identical values. Basically they used >= and <= instead of > and <.