I am trying to develop a ML algorithm for which I need to build the first version of the input data. The data is stored in an Oracle database, and it consists of some clients that have some contracts with the company. The contracts have some quantitative variables asociated to them such as duration of the contract, number of products asociated to the contract, price of the products... The clients may have (and almost always have) multiple different contracts with the company and I need one row per client, so I have though of grouping by client and computing the MIN
, MAX
, AVG
, VARIANCE
and MEDIAN
of those quantitative variables.
However, there are lots of variables asociated to the contract from which I would like to compute those descriptives (almost a 100), so I would like to find both an elegant and efficient way of writing my query.
To achieve that I have thought of two solutions, a simple one using one GROUP BY
clause and writing one line per descriptive variable I want to compute (thus, in my real scenario write almost 500 lines in the SELECT
clause) or use a combination of pivots and unpivots to achieve the same output with much less lines of code.
A simplified version of my first approach would look something like the following query:
WITH aux (
client_group,
age,
height,
weight
) AS (
SELECT
1,
12,
64.94189341,
16.95628922
FROM
dual
UNION ALL
SELECT
1,
43,
54.01820311,
78.21996174
FROM
dual
UNION ALL
SELECT
1,
54,
51.65431586,
196.9548512 FROM
dual
UNION ALL
SELECT
1,
12,
72.42150484,
47.87076223
FROM
dual
UNION ALL
SELECT
1,
64,
90.16203412,
149.3447339
FROM
dual
UNION ALL
SELECT 1,
87,
44.23760904,
5.825719181 FROM
dual
UNION ALL
SELECT
2,10,
90.59152129,
24.42941936
FROM
dual
UNION ALL
SELECT
2,20,
93.0075109,
166.2218454
FROM
dual
UNION ALL
SELECT
2,30,55.00015751, 188.3958287
FROM
dual
UNION ALL
SELECT
2,
45,58.31594818, 162.7184656
FROM
dual
UNION ALL
SELECT
2,
90,
97.83613127,
38.68734517
FROM
dual
UNION ALL
SELECT
2,
10,
12.30569768,
186.3157839
FROM
dual
UNION ALL
SELECT
3,
93,
15.29331011,
117.8718876
FROM
dual
UNION ALL
SELECT 3,84,
86.91246955,
194.1351409
FROM
dual
UNION ALL
SELECT
3,
95,
74.95595866,
60.42511642
FROM
dual
UNION ALL
SELECT
3,
48,
74.26342598,
196.673485
FROM
dual
UNION ALL
SELECT
3,
10,
94.60058165,
130.5963113
FROM
dual
UNION ALL
SELECT
3,
9,
74.81643457,
77.17657753
FROM
dual
)
SELECT
client_group,
MIN(age),
MAX(age),
AVG(age),
VARIANCE(age),
MEDIAN(age),
MIN(age),
MAX(age),
AVG(age),
VARIANCE(age),
MEDIAN(age),
MIN(age),
MAX(age),
AVG(age),
VARIANCE(age),
MEDIAN(age)
FROM
aux
GROUP BY
client_group;
The table used is a sample one I have created by myself, the real one has much more clients (almost 13 million) and much more variables (almost a 100). As you can see, with this approach I would need to write a lot of redundant lines in the SELECT
clause to get all the variables and in my opinion is kinda ugly.
My other approach however would look something like the following:
WITH aux (
client_group,
age,
height,
weight
) AS (
SELECT
1,
12,
64.94189341,
16.95628922
FROM
dual
UNION ALL
SELECT
1,
43,
54.01820311,
78.21996174
FROM
dual
UNION ALL
SELECT
1,
54,
51.65431586,
196.9548512 FROM
dual
UNION ALL
SELECT
1,
12,
72.42150484,
47.87076223
FROM
dual
UNION ALL
SELECT
1,
64,
90.16203412,
149.3447339
FROM
dual
UNION ALL
SELECT 1,
87,
44.23760904,
5.825719181 FROM
dual
UNION ALL
SELECT
2,10,
90.59152129,
24.42941936
FROM
dual
UNION ALL
SELECT
2,20,
93.0075109,
166.2218454
FROM
dual
UNION ALL
SELECT
2,30,55.00015751, 188.3958287
FROM
dual
UNION ALL
SELECT
2,
45,58.31594818, 162.7184656
FROM
dual
UNION ALL
SELECT
2,
90,
97.83613127,
38.68734517
FROM
dual
UNION ALL
SELECT
2,
10,
12.30569768,
186.3157839
FROM
dual
UNION ALL
SELECT
3,
93,
15.29331011,
117.8718876
FROM
dual
UNION ALL
SELECT 3,84,
86.91246955,
194.1351409
FROM
dual
UNION ALL
SELECT
3,
95,
74.95595866,
60.42511642
FROM
dual
UNION ALL
SELECT
3,
48,
74.26342598,
196.673485
FROM
dual
UNION ALL
SELECT
3,
10,
94.60058165,
130.5963113
FROM
dual
UNION ALL
SELECT
3,
9,
74.81643457,
77.17657753
FROM
dual
),
aux2 AS (
SELECT
client_group,
col_name,
value_col
FROM
aux UNPIVOT ( value_col
FOR col_name
IN ( age,
height,
weight ) )
), aux3 AS (
SELECT DISTINCT
client_group,
col_name,
AVG(value_col) AS mean,
VARIANCE(value_col) AS variancee,
MEDIAN(value_col) AS mediann,
MAX(value_col) AS maximum,
MIN(value_col) AS minimum
FROM
aux2
GROUP BY
client_group,
col_name
)
SELECT
*
FROM
aux3 PIVOT (
MAX ( mean )
AS mean, MAX ( variancee ) AS variancee, MAX ( mediann ) AS mediann, MAX(maximum) AS maximum, MIN(minimum) AS minimum
FOR col_name
IN ( 'HEIGHT' AS "HEIGHT", 'AGE' AS "AGE", 'WEIGHT' AS "WEIGHT", 'MIN' AS "MIN", 'MAX' AS "MAX" )
);
This approach however would imply much less lines of query in the real scenario, as I only need to add all the desired variables in the aux2 table once. However, this query unpivots a huge wide-format table to long-format, then groups it and then pivots again long-to-wide, and I doubt about the efficiency of that.
So, which approach should I use? I have tried to look at some metrics in the V$SQL
special Oracle database table but the queries with GROUP BY
clauses do not appear there, so I do not know where to look.
Thenk you very much in advance, if you need further details about anything ask for them and I will provide.
If you want to aggregate multiple rows into one then use GROUP BY
and aggregation functions:
SELECT client_group,
MIN(age),
MAX(age),
AVG(age),
VARIANCE(age),
MEDIAN(age),
MIN(weight),
MAX(weight),
AVG(weight),
VARIANCE(weight),
MEDIAN(weight),
MIN(height),
MAX(height),
AVG(height),
VARIANCE(height),
MEDIAN(height)
FROM aux
GROUP BY
client_group;
CLIENT_GROUP | MIN(AGE) | MAX(AGE) | AVG(AGE) | VARIANCE(AGE) | MEDIAN(AGE) | MIN(WEIGHT) | MAX(WEIGHT) | AVG(WEIGHT) | VARIANCE(WEIGHT) | MEDIAN(WEIGHT) | MIN(HEIGHT) | MAX(HEIGHT) | AVG(HEIGHT) | VARIANCE(HEIGHT) | MEDIAN(HEIGHT) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 12 | 87 | 45.33333333333333333333333333333333333333 | 877.46666666666666666666666666666666666 | 48.5 | 5.825719181 | 196.9548512 | 82.5287195785 | 5792.1105324904466527375 | 63.045361985 | 44.23760904 | 90.16203412 | 62.90592673 | 278.3366481669938008 | 59.48004826 |
2 | 10 | 90 | 34.16666666666666666666666666666666666667 | 924.166666666666666666666666666666666667 | 25 | 24.42941936 | 188.3958287 | 127.794781355 | 5683.60617887341174847 | 164.4701555 | 12.30569768 | 97.83613127 | 67.842827805 | 1078.08620430126785475 | 74.453734735 |
3 | 9 | 95 | 56.5 | 1612.3 | 66 | 60.42511642 | 196.673485 | 129.479753125 | 3267.09420872211803671 | 124.23409945 | 15.29331011 | 94.60058165 | 70.14036342 | 789.97206409484731992 | 74.886196615 |
There are no redundant statements in that query as each aggregation function calculates a different metric for a different input column and that is about as succinct as you can get for a single row-per-client_group
.
Another alternative would be to have one row-per-variable-per-client_group
:
SELECT client_group,
type,
MIN(value),
MAX(value),
AVG(value),
VARIANCE(value),
MEDIAN(value)
FROM aux
UNPIVOT(value FOR type IN (age, weight, height))
GROUP BY
client_group,
type;
Which outputs:
CLIENT_GROUP | TYPE | MIN(VALUE) | MAX(VALUE) | AVG(VALUE) | VARIANCE(VALUE) | MEDIAN(VALUE) |
---|---|---|---|---|---|---|
1 | AGE | 12 | 87 | 45.33333333333333333333333333333333333333 | 877.46666666666666666666666666666666666 | 48.5 |
1 | HEIGHT | 44.23760904 | 90.16203412 | 62.90592673 | 278.3366481669938008 | 59.48004826 |
1 | WEIGHT | 5.825719181 | 196.9548512 | 82.5287195785 | 5792.1105324904466527375 | 63.045361985 |
2 | AGE | 10 | 90 | 34.16666666666666666666666666666666666667 | 924.166666666666666666666666666666666667 | 25 |
2 | HEIGHT | 12.30569768 | 97.83613127 | 67.842827805 | 1078.08620430126785475 | 74.453734735 |
2 | WEIGHT | 24.42941936 | 188.3958287 | 127.794781355 | 5683.60617887341174847 | 164.4701555 |
3 | AGE | 9 | 95 | 56.5 | 1612.3 | 66 |
3 | HEIGHT | 15.29331011 | 94.60058165 | 70.14036342 | 789.97206409484731992 | 74.886196615 |
3 | WEIGHT | 60.42511642 | 196.673485 | 129.479753125 | 3267.09420872211803671 | 124.23409945 |
You get exactly the same data just split into multiple rows for each metric.
There is no point using the second method and then re-pivoting back to having multiple columns on a single row; that is just doing lots of extra work for no benefit when compared to the first method.
The explain plan for the first query is:
Plan hash value: 1572259232
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 63 | 4 (25)| 00:00:01 |
| 1 | SORT GROUP BY | | 3 | 63 | 4 (25)| 00:00:01 |
| 2 | TABLE ACCESS FULL| AUX | 18 | 378 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
and compare than to the explain plan for the query:
EXPLAIN PLAN FOR
SELECT *
FROM (
SELECT client_group,
type,
MIN(value) AS min_value,
MAX(value) AS max_value,
AVG(value) AS avg_value,
VARIANCE(value) AS var_value,
MEDIAN(value) AS med_value
FROM aux
UNPIVOT INCLUDE NULLS (value FOR type IN (age, weight, height))
GROUP BY
client_group,
type
)
PIVOT (
MAX(min_value) AS min,
MAX(max_value) AS max,
MAX(avg_value) AS avg,
MAX(var_value) AS var,
MAX(med_value) AS med
FOR type IN (
'AGE' AS age,
'WEIGHT' AS weight,
'HEIGHT' AS height
)
);
is:
Plan hash value: 4041832950
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 54 | 4482 | 6 (17)| 00:00:01 |
| 1 | SORT GROUP BY NOSORT PIVOT| | 54 | 4482 | 6 (17)| 00:00:01 |
| 2 | VIEW | | 54 | 4482 | 6 (17)| 00:00:01 |
| 3 | SORT GROUP BY | | 54 | 1674 | 6 (17)| 00:00:01 |
| 4 | VIEW | | 54 | 1674 | 5 (0)| 00:00:01 |
| 5 | UNPIVOT | | | | | |
| 6 | TABLE ACCESS FULL | AUX | 18 | 378 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
then you can see how much more work is being done just so you can type a few fewer characters. Just use first method and GROUP BY
and either copy-paste the aggregation functions and edit the column name you are aggregating by or write a small SQL script to generate the aggregation query from the data dictionary.
SELECT 'SELECT client_group'
|| LISTAGG(
', MAX(' || column_name || ')'
|| ', MIN(' || column_name || ')'
|| ', AVG(' || column_name || ')'
|| ', VARIANCE(' || column_name || ')'
|| ', MEDIAN(' || column_name || ')'
) WiTHIN GROUP (ORDER BY column_id)
|| ' FROM aux GROUP BY client_group' AS sql
FROM user_tab_columns
WHERE table_name = 'AUX'
AND column_name <> 'CLIENT_GROUP';
Outputs:
SQL |
---|
SELECT client_group, MAX(AGE), MIN(AGE), AVG(AGE), VARIANCE(AGE), MEDIAN(AGE), MAX(HEIGHT), MIN(HEIGHT), AVG(HEIGHT), VARIANCE(HEIGHT), MEDIAN(HEIGHT), MAX(WEIGHT), MIN(WEIGHT), AVG(WEIGHT), VARIANCE(WEIGHT), MEDIAN(WEIGHT) FROM aux GROUP BY client_group |
(However, you may run into issues if your generated SQL statement is going to be longer than 4000 characters and need to aggregate into a CLOB
- but there are ways around that.)