Search code examples
javabigdatacolt

On repeatability of quantile estimates


I need to find arbitrary quantiles of a large stream of data (does't fit in memory) and the results need to be repeatable i.e for the same stream the results should be identical. I have been using colt for this and the results are not repeatable.

Is there another library out there that passes these requirements?

What do I have to do to make results of quantile binning repeatable with colt (I'm using 1.2.0)? I've used a random seed in my random numbers but it looks like colt introduces its own randomness. I can't figure out.

I get the following results for two different runs. If they were repeatable, the results would be the same:

[0.0990242124295947, 0.20014652659912247, 0.2996443961549412]
[0.09994965676310263, 0.20079195488768953, 0.29986981667267676]

Here is the code that generates it:

public class QuantileTest {

    public static void main(String[] args) throws IOException, Exception {
        QuantileBin1D qBins = new QuantileBin1D(false, Long.MAX_VALUE, 0.001, 0.0001, 64, null);
        Random rand = new Random(0);
        for (int i = 0; i < 1500000; i++) {
            double num = rand.nextDouble();;
            qBins.add(num);
        }

        DoubleArrayList qMarks = new DoubleArrayList(new double[] {0.1, 0.2, 0.3});
        double[] xMarks = qBins.quantiles(qMarks).elements();
        System.out.println(Arrays.toString(xMarks));
    }
}

Solution

  • There is still some randomness as you do not supply a RandomEngine to the QuantileBin1D. In some classes (RandomSampler was the first occurence I found) a default RandomEngine will be created which seems to be not repeatable.

    if (randomGenerator==null) randomGenerator = cern.jet.random.AbstractDistribution.makeDefaultGenerator();
        this.my_RandomGenerator=randomGenerator;
    

    You should change the constructor to new QuantileBin1D(false, Long.MAX_VALUE, 0.001, 0.0001, 64, new DRand());

    with cern.jet.random.engine.DRand were the default constructor is documented with

    Constructs and returns a random number generator with a default seed, which is a constant.

    This should lead to non-random results.