What is the most efficient way of calculating multiple descriptives per group over lots of variables in Oracle SQL?

I am trying to develop a ML algorithm for which I need to build the first version of the input data. The data is stored in an Oracle database, and it consists of some clients that have some contracts with the company. The contracts have some quantitative variables asociated to them such as duration of the contract, number of products asociated to the contract, price of the products... The clients may have (and almost always have) multiple different contracts with the company and I need one row per client, so I have though of grouping by client and computing the MIN, MAX, AVG, VARIANCE and MEDIAN of those quantitative variables.

However, there are lots of variables asociated to the contract from which I would like to compute those descriptives (almost a 100), so I would like to find both an elegant and efficient way of writing my query.

To achieve that I have thought of two solutions, a simple one using one GROUP BY clause and writing one line per descriptive variable I want to compute (thus, in my real scenario write almost 500 lines in the SELECT clause) or use a combination of pivots and unpivots to achieve the same output with much less lines of code.

A simplified version of my first approach would look something like the following query:

WITH aux (
    client_group,
    age,
    height,
    weight
) AS (
    SELECT
        1,
        12,
        64.94189341,
        16.95628922
    FROM
        dual
    UNION ALL
    SELECT
        1,
        43,
        54.01820311,
        78.21996174
    FROM
        dual
    UNION ALL
    SELECT
        1,
        54,
        51.65431586,
        196.9548512 FROM
        dual
    UNION ALL
    SELECT
        1,
        12,
        72.42150484,
        47.87076223
    FROM
        dual
    UNION ALL
    SELECT
        1,
        64,
        90.16203412,
        149.3447339
    FROM
        dual
    UNION ALL
    SELECT 1,
        87,
        44.23760904,
        5.825719181 FROM
        dual
    UNION ALL
    SELECT
        2,10,
        90.59152129,
        24.42941936
    FROM
        dual
    UNION ALL
    SELECT
        2,20,
        93.0075109,
        166.2218454
    FROM
        dual
    UNION ALL
    SELECT
        2,30,55.00015751,  188.3958287
    FROM
        dual
    UNION ALL
    SELECT
        2,
        45,58.31594818,  162.7184656
    FROM
        dual
    UNION ALL
    SELECT
        2,
        90,
        97.83613127,
        38.68734517
    FROM
        dual
    UNION ALL
    SELECT
        2,
        10,
        12.30569768,
        186.3157839
    FROM
        dual
    UNION ALL
    SELECT
        3,
        93,
        15.29331011,
        117.8718876
    FROM
        dual
    UNION ALL
    SELECT 3,84,
        86.91246955,
        194.1351409
    FROM
        dual
    UNION ALL
    SELECT
        3,
        95,
        74.95595866,
        60.42511642
    FROM
        dual
    UNION ALL
    SELECT
        3,
        48,
        74.26342598,
        196.673485
    FROM
        dual
    UNION ALL
    SELECT
        3,
        10,
        94.60058165,
        130.5963113
    FROM
        dual
    UNION ALL
    SELECT
        3,
        9,
        74.81643457,
        77.17657753
    FROM
        dual
)
SELECT
    client_group,
    MIN(age),
    MAX(age),
    AVG(age),
    VARIANCE(age),
    MEDIAN(age),
    MIN(age),
    MAX(age),
    AVG(age),
    VARIANCE(age),
    MEDIAN(age),
    MIN(age),
    MAX(age),
    AVG(age),
    VARIANCE(age),
    MEDIAN(age)
FROM
    aux
GROUP BY
    client_group;

The table used is a sample one I have created by myself, the real one has much more clients (almost 13 million) and much more variables (almost a 100). As you can see, with this approach I would need to write a lot of redundant lines in the SELECT clause to get all the variables and in my opinion is kinda ugly.

My other approach however would look something like the following:

WITH aux (
    client_group,
    age,
    height,
    weight
) AS (
    SELECT
        1,
        12,
        64.94189341,
        16.95628922
    FROM
        dual
    UNION ALL
    SELECT
        1,
        43,
        54.01820311,
        78.21996174
    FROM
        dual
    UNION ALL
    SELECT
        1,
        54,
        51.65431586,
        196.9548512 FROM
        dual
    UNION ALL
    SELECT
        1,
        12,
        72.42150484,
        47.87076223
    FROM
        dual
    UNION ALL
    SELECT
        1,
        64,
        90.16203412,
        149.3447339
    FROM
        dual
    UNION ALL
    SELECT 1,
        87,
        44.23760904,
        5.825719181 FROM
        dual
    UNION ALL
    SELECT
        2,10,
        90.59152129,
        24.42941936
    FROM
        dual
    UNION ALL
    SELECT
        2,20,
        93.0075109,
        166.2218454
    FROM
        dual
    UNION ALL
    SELECT
        2,30,55.00015751,  188.3958287
    FROM
        dual
    UNION ALL
    SELECT
        2,
        45,58.31594818,  162.7184656
    FROM
        dual
    UNION ALL
    SELECT
        2,
        90,
        97.83613127,
        38.68734517
    FROM
        dual
    UNION ALL
    SELECT
        2,
        10,
        12.30569768,
        186.3157839
    FROM
        dual
    UNION ALL
    SELECT
        3,
        93,
        15.29331011,
        117.8718876
    FROM
        dual
    UNION ALL
    SELECT 3,84,
        86.91246955,
        194.1351409
    FROM
        dual
    UNION ALL
    SELECT
        3,
        95,
        74.95595866,
        60.42511642
    FROM
        dual
    UNION ALL
    SELECT
        3,
        48,
        74.26342598,
        196.673485
    FROM
        dual
    UNION ALL
    SELECT
        3,
        10,
        94.60058165,
        130.5963113
    FROM
        dual
    UNION ALL
    SELECT
        3,
        9,
        74.81643457,
        77.17657753
    FROM
        dual
),
aux2 AS (
    SELECT
        client_group,
        col_name,
        value_col
    FROM
        aux UNPIVOT ( value_col
            FOR col_name
        IN ( age,
             height,
             weight ) )
), aux3 AS (
    SELECT DISTINCT
        client_group,
        col_name,
        AVG(value_col) AS mean,
        VARIANCE(value_col) AS variancee,
        MEDIAN(value_col) AS mediann,
        MAX(value_col) AS maximum,
        MIN(value_col) AS minimum
    FROM
        aux2
    GROUP BY
        client_group,
        col_name
)
SELECT
    *
FROM
    aux3 PIVOT (
        MAX ( mean )
    AS mean, MAX ( variancee ) AS variancee, MAX ( mediann ) AS mediann, MAX(maximum) AS maximum, MIN(minimum) AS minimum
        FOR col_name
        IN ( 'HEIGHT' AS "HEIGHT", 'AGE' AS "AGE", 'WEIGHT' AS "WEIGHT", 'MIN' AS "MIN", 'MAX' AS "MAX" )
    );

This approach however would imply much less lines of query in the real scenario, as I only need to add all the desired variables in the aux2 table once. However, this query unpivots a huge wide-format table to long-format, then groups it and then pivots again long-to-wide, and I doubt about the efficiency of that.

So, which approach should I use? I have tried to look at some metrics in the V$SQL special Oracle database table but the queries with GROUP BY clauses do not appear there, so I do not know where to look.

Thenk you very much in advance, if you need further details about anything ask for them and I will provide.

Solution

If you want to aggregate multiple rows into one then use GROUP BY and aggregation functions:

SELECT client_group,
       MIN(age),
       MAX(age),
       AVG(age),
       VARIANCE(age),
       MEDIAN(age),
       MIN(weight),
       MAX(weight),
       AVG(weight),
       VARIANCE(weight),
       MEDIAN(weight),
       MIN(height),
       MAX(height),
       AVG(height),
       VARIANCE(height),
       MEDIAN(height)
FROM   aux
GROUP BY
       client_group;

CLIENT_GROUP	MIN(AGE)	MAX(AGE)	AVG(AGE)	VARIANCE(AGE)	MEDIAN(AGE)	MIN(WEIGHT)	MAX(WEIGHT)	AVG(WEIGHT)	VARIANCE(WEIGHT)	MEDIAN(WEIGHT)	MIN(HEIGHT)	MAX(HEIGHT)	AVG(HEIGHT)	VARIANCE(HEIGHT)	MEDIAN(HEIGHT)
1	12	87	45.33333333333333333333333333333333333333	877.46666666666666666666666666666666666	48.5	5.825719181	196.9548512	82.5287195785	5792.1105324904466527375	63.045361985	44.23760904	90.16203412	62.90592673	278.3366481669938008	59.48004826
2	10	90	34.16666666666666666666666666666666666667	924.166666666666666666666666666666666667	25	24.42941936	188.3958287	127.794781355	5683.60617887341174847	164.4701555	12.30569768	97.83613127	67.842827805	1078.08620430126785475	74.453734735
3	9	95	56.5	1612.3	66	60.42511642	196.673485	129.479753125	3267.09420872211803671	124.23409945	15.29331011	94.60058165	70.14036342	789.97206409484731992	74.886196615

There are no redundant statements in that query as each aggregation function calculates a different metric for a different input column and that is about as succinct as you can get for a single row-per-client_group.

Another alternative would be to have one row-per-variable-per-client_group:

SELECT client_group,
       type,
       MIN(value),
       MAX(value),
       AVG(value),
       VARIANCE(value),
       MEDIAN(value)
FROM   aux
UNPIVOT(value FOR type IN (age, weight, height))
GROUP BY
       client_group,
       type;

Which outputs:

CLIENT_GROUP	TYPE	MIN(VALUE)	MAX(VALUE)	AVG(VALUE)	VARIANCE(VALUE)	MEDIAN(VALUE)
1	AGE	12	87	45.33333333333333333333333333333333333333	877.46666666666666666666666666666666666	48.5
1	HEIGHT	44.23760904	90.16203412	62.90592673	278.3366481669938008	59.48004826
1	WEIGHT	5.825719181	196.9548512	82.5287195785	5792.1105324904466527375	63.045361985
2	AGE	10	90	34.16666666666666666666666666666666666667	924.166666666666666666666666666666666667	25
2	HEIGHT	12.30569768	97.83613127	67.842827805	1078.08620430126785475	74.453734735
2	WEIGHT	24.42941936	188.3958287	127.794781355	5683.60617887341174847	164.4701555
3	AGE	9	95	56.5	1612.3	66
3	HEIGHT	15.29331011	94.60058165	70.14036342	789.97206409484731992	74.886196615
3	WEIGHT	60.42511642	196.673485	129.479753125	3267.09420872211803671	124.23409945

You get exactly the same data just split into multiple rows for each metric.

There is no point using the second method and then re-pivoting back to having multiple columns on a single row; that is just doing lots of extra work for no benefit when compared to the first method.

The explain plan for the first query is:

Plan hash value: 1572259232
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     3 |    63 |     4  (25)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |     3 |    63 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| AUX  |    18 |   378 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

and compare than to the explain plan for the query:

EXPLAIN PLAN FOR
SELECT *
FROM   (
  SELECT client_group,
         type,
         MIN(value) AS min_value,
         MAX(value) AS max_value,
         AVG(value) AS avg_value,
         VARIANCE(value) AS var_value,
         MEDIAN(value) AS med_value
  FROM   aux
  UNPIVOT INCLUDE NULLS (value FOR type IN (age, weight, height))
  GROUP BY
         client_group,
         type
)
PIVOT (
  MAX(min_value) AS min,
  MAX(max_value) AS max,
  MAX(avg_value) AS avg,
  MAX(var_value) AS var,
  MAX(med_value) AS med
  FOR type IN (
    'AGE' AS age,
    'WEIGHT' AS weight,
    'HEIGHT' AS height
  )
);

is:

Plan hash value: 4041832950
 
-----------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |    54 |  4482 |     6  (17)| 00:00:01 |
|   1 |  SORT GROUP BY NOSORT PIVOT|      |    54 |  4482 |     6  (17)| 00:00:01 |
|   2 |   VIEW                     |      |    54 |  4482 |     6  (17)| 00:00:01 |
|   3 |    SORT GROUP BY           |      |    54 |  1674 |     6  (17)| 00:00:01 |
|   4 |     VIEW                   |      |    54 |  1674 |     5   (0)| 00:00:01 |
|   5 |      UNPIVOT               |      |       |       |            |          |
|   6 |       TABLE ACCESS FULL    | AUX  |    18 |   378 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

then you can see how much more work is being done just so you can type a few fewer characters. Just use first method and GROUP BY and either copy-paste the aggregation functions and edit the column name you are aggregating by or write a small SQL script to generate the aggregation query from the data dictionary.

SELECT 'SELECT client_group'
       || LISTAGG(
               ', MAX(' || column_name || ')'
            || ', MIN(' || column_name || ')'
            || ', AVG(' || column_name || ')'
            || ', VARIANCE(' || column_name || ')'
            || ', MEDIAN(' || column_name || ')'
          ) WiTHIN GROUP (ORDER BY column_id)
       || ' FROM aux GROUP BY client_group' AS sql
FROM   user_tab_columns
WHERE  table_name = 'AUX'
AND    column_name <> 'CLIENT_GROUP';

Outputs:

SQL
SELECT client_group, MAX(AGE), MIN(AGE), AVG(AGE), VARIANCE(AGE), MEDIAN(AGE), MAX(HEIGHT), MIN(HEIGHT), AVG(HEIGHT), VARIANCE(HEIGHT), MEDIAN(HEIGHT), MAX(WEIGHT), MIN(WEIGHT), AVG(WEIGHT), VARIANCE(WEIGHT), MEDIAN(WEIGHT) FROM aux GROUP BY client_group

(However, you may run into issues if your generated SQL statement is going to be longer than 4000 characters and need to aggregate into a CLOB - but there are ways around that.)

fiddle