Search code examples

Another approach to percentiles?

I have a dataset which essentially consists of a list of job batches, the number of jobs contained in each batch, and the duration of each job batch. Here is a sample dataset:

CREATE TABLE test_data
   batch_id    NUMBER,
   job_count   NUMBER,
   duration    NUMBER

INSERT INTO test_data VALUES (1, 37, 9);
INSERT INTO test_data VALUES (2, 47, 4);
INSERT INTO test_data VALUES (3, 66, 6);
INSERT INTO test_data VALUES (4, 46, 6);
INSERT INTO test_data VALUES (5, 54, 1);
INSERT INTO test_data VALUES (6, 35, 1);
INSERT INTO test_data VALUES (7, 55, 9);
INSERT INTO test_data VALUES (8, 82, 7);
INSERT INTO test_data VALUES (9, 12, 9);
INSERT INTO test_data VALUES (10, 52, 4);
INSERT INTO test_data VALUES (11, 3, 9);
INSERT INTO test_data VALUES (12, 90, 2);

Now, I want to calculate some percentiles for the duration field. Typically, this is done with something like the following:

       PERCENTILE_DISC( 0.75 )
          WITHIN GROUP (ORDER BY duration ASC)
          AS third_quartile

(Which gives the result of 9)

My problem here is that we don't want to get the percentiles based on batches, I want to get them based on individual jobs. I can figure this out by hand quite easily by generating a running total of the job_count:

       OVER (
              ORDER BY duration
          AS total_jobs,
       duration ASC;

6            35           35           1            
5            54           89           1            
12           90           179          2            
2            47           226          4            
10           52           278          4            
3            66           344          6            
4            46           390          6            
8            82           472          7            
9            12           484          9            
1            37           521          9            
11           3            524          9            
7            55           579          9           

Since I have 579 jobs, then the 75th percentile would be job 434. Looking at the above result set, that corresponds with a duration of 7, different from what the standard function does.

Essentially, I want to consider each job in a batch as a separate observation, and determine percentiles based on those, instead on the batches.

Is there a relatively simple way to accomplish this?


  • I would think of this as "weighted" percentiles. I don't know if there is a built-in analytic function for this in Oracle, but it is easy enough to calculate. And you are on the way there.

    The additional idea is to calculate the total number of jobs, and then use arithmetic to select the value you want. For the 75th percentile, the value is the smallest duration such that the cumulative number of jobs is greater than 0.75 times the total number of jobs.

    Here is the example in SQL:

    select pcs.percentile, min(case when cumjobs >= totjobs * percentile then duration end)
    from (SELECT batch_id, job_count,
                 SUM(job_count) OVER (ORDER BY duration) as cumjobs,
                 sum(job_count) over () as totjobs,
          FROM test_data
         ) t cross join
         (select 0.25 as percentile from dual union all
          select 0.5 from dual union all
          select 0.75 from dual
         ) pcs
    group by pcs.percentile;

    This example gives you the percentile values (and as an added bonus, for three different percentiles) with each value on its own row. If you want the values on each row, you need to join back to your original table.