I have a dataset which essentially consists of a list of job batches, the number of jobs contained in each batch, and the duration of each job batch. Here is a sample dataset:
CREATE TABLE test_data
(
batch_id NUMBER,
job_count NUMBER,
duration NUMBER
);
INSERT INTO test_data VALUES (1, 37, 9);
INSERT INTO test_data VALUES (2, 47, 4);
INSERT INTO test_data VALUES (3, 66, 6);
INSERT INTO test_data VALUES (4, 46, 6);
INSERT INTO test_data VALUES (5, 54, 1);
INSERT INTO test_data VALUES (6, 35, 1);
INSERT INTO test_data VALUES (7, 55, 9);
INSERT INTO test_data VALUES (8, 82, 7);
INSERT INTO test_data VALUES (9, 12, 9);
INSERT INTO test_data VALUES (10, 52, 4);
INSERT INTO test_data VALUES (11, 3, 9);
INSERT INTO test_data VALUES (12, 90, 2);
Now, I want to calculate some percentiles for the duration field. Typically, this is done with something like the following:
SELECT
PERCENTILE_DISC( 0.75 )
WITHIN GROUP (ORDER BY duration ASC)
AS third_quartile
FROM
test_data;
(Which gives the result of 9)
My problem here is that we don't want to get the percentiles based on batches, I want to get them based on individual jobs. I can figure this out by hand quite easily by generating a running total of the job_count:
SELECT
batch_id,
job_count,
SUM(
job_count
)
OVER (
ORDER BY duration
ROWS UNBOUNDED PRECEDING
)
AS total_jobs,
duration
FROM
test_data
ORDER BY
duration ASC;
BATCH_ID JOB_COUNT TOTAL_JOBS DURATION
6 35 35 1
5 54 89 1
12 90 179 2
2 47 226 4
10 52 278 4
3 66 344 6
4 46 390 6
8 82 472 7
9 12 484 9
1 37 521 9
11 3 524 9
7 55 579 9
Since I have 579 jobs, then the 75th percentile would be job 434. Looking at the above result set, that corresponds with a duration of 7, different from what the standard function does.
Essentially, I want to consider each job in a batch as a separate observation, and determine percentiles based on those, instead on the batches.
Is there a relatively simple way to accomplish this?
I would think of this as "weighted" percentiles. I don't know if there is a built-in analytic function for this in Oracle, but it is easy enough to calculate. And you are on the way there.
The additional idea is to calculate the total number of jobs, and then use arithmetic to select the value you want. For the 75th percentile, the value is the smallest duration such that the cumulative number of jobs is greater than 0.75 times the total number of jobs.
Here is the example in SQL:
select pcs.percentile, min(case when cumjobs >= totjobs * percentile then duration end)
from (SELECT batch_id, job_count,
SUM(job_count) OVER (ORDER BY duration) as cumjobs,
sum(job_count) over () as totjobs,
duration
FROM test_data
) t cross join
(select 0.25 as percentile from dual union all
select 0.5 from dual union all
select 0.75 from dual
) pcs
group by pcs.percentile;
This example gives you the percentile values (and as an added bonus, for three different percentiles) with each value on its own row. If you want the values on each row, you need to join back to your original table.