Search code examples
sqloracleperformance

What is the most efficient way of calculating multiple descriptives per group over lots of variables in Oracle SQL?


I am trying to develop a ML algorithm for which I need to build the first version of the input data. The data is stored in an Oracle database, and it consists of some clients that have some contracts with the company. The contracts have some quantitative variables asociated to them such as duration of the contract, number of products asociated to the contract, price of the products... The clients may have (and almost always have) multiple different contracts with the company and I need one row per client, so I have though of grouping by client and computing the MIN, MAX, AVG, VARIANCE and MEDIAN of those quantitative variables.

However, there are lots of variables asociated to the contract from which I would like to compute those descriptives (almost a 100), so I would like to find both an elegant and efficient way of writing my query.

To achieve that I have thought of two solutions, a simple one using one GROUP BY clause and writing one line per descriptive variable I want to compute (thus, in my real scenario write almost 500 lines in the SELECT clause) or use a combination of pivots and unpivots to achieve the same output with much less lines of code.

A simplified version of my first approach would look something like the following query:

WITH aux (
    client_group,
    age,
    height,
    weight
) AS (
    SELECT
        1,
        12,
        64.94189341,
        16.95628922
    FROM
        dual
    UNION ALL
    SELECT
        1,
        43,
        54.01820311,
        78.21996174
    FROM
        dual
    UNION ALL
    SELECT
        1,
        54,
        51.65431586,
        196.9548512 FROM
        dual
    UNION ALL
    SELECT
        1,
        12,
        72.42150484,
        47.87076223
    FROM
        dual
    UNION ALL
    SELECT
        1,
        64,
        90.16203412,
        149.3447339
    FROM
        dual
    UNION ALL
    SELECT 1,
        87,
        44.23760904,
        5.825719181 FROM
        dual
    UNION ALL
    SELECT
        2,10,
        90.59152129,
        24.42941936
    FROM
        dual
    UNION ALL
    SELECT
        2,20,
        93.0075109,
        166.2218454
    FROM
        dual
    UNION ALL
    SELECT
        2,30,55.00015751,  188.3958287
    FROM
        dual
    UNION ALL
    SELECT
        2,
        45,58.31594818,  162.7184656
    FROM
        dual
    UNION ALL
    SELECT
        2,
        90,
        97.83613127,
        38.68734517
    FROM
        dual
    UNION ALL
    SELECT
        2,
        10,
        12.30569768,
        186.3157839
    FROM
        dual
    UNION ALL
    SELECT
        3,
        93,
        15.29331011,
        117.8718876
    FROM
        dual
    UNION ALL
    SELECT 3,84,
        86.91246955,
        194.1351409
    FROM
        dual
    UNION ALL
    SELECT
        3,
        95,
        74.95595866,
        60.42511642
    FROM
        dual
    UNION ALL
    SELECT
        3,
        48,
        74.26342598,
        196.673485
    FROM
        dual
    UNION ALL
    SELECT
        3,
        10,
        94.60058165,
        130.5963113
    FROM
        dual
    UNION ALL
    SELECT
        3,
        9,
        74.81643457,
        77.17657753
    FROM
        dual
)
SELECT
    client_group,
    MIN(age),
    MAX(age),
    AVG(age),
    VARIANCE(age),
    MEDIAN(age),
    MIN(age),
    MAX(age),
    AVG(age),
    VARIANCE(age),
    MEDIAN(age),
    MIN(age),
    MAX(age),
    AVG(age),
    VARIANCE(age),
    MEDIAN(age)
FROM
    aux
GROUP BY
    client_group;

The table used is a sample one I have created by myself, the real one has much more clients (almost 13 million) and much more variables (almost a 100). As you can see, with this approach I would need to write a lot of redundant lines in the SELECT clause to get all the variables and in my opinion is kinda ugly.

My other approach however would look something like the following:

WITH aux (
    client_group,
    age,
    height,
    weight
) AS (
    SELECT
        1,
        12,
        64.94189341,
        16.95628922
    FROM
        dual
    UNION ALL
    SELECT
        1,
        43,
        54.01820311,
        78.21996174
    FROM
        dual
    UNION ALL
    SELECT
        1,
        54,
        51.65431586,
        196.9548512 FROM
        dual
    UNION ALL
    SELECT
        1,
        12,
        72.42150484,
        47.87076223
    FROM
        dual
    UNION ALL
    SELECT
        1,
        64,
        90.16203412,
        149.3447339
    FROM
        dual
    UNION ALL
    SELECT 1,
        87,
        44.23760904,
        5.825719181 FROM
        dual
    UNION ALL
    SELECT
        2,10,
        90.59152129,
        24.42941936
    FROM
        dual
    UNION ALL
    SELECT
        2,20,
        93.0075109,
        166.2218454
    FROM
        dual
    UNION ALL
    SELECT
        2,30,55.00015751,  188.3958287
    FROM
        dual
    UNION ALL
    SELECT
        2,
        45,58.31594818,  162.7184656
    FROM
        dual
    UNION ALL
    SELECT
        2,
        90,
        97.83613127,
        38.68734517
    FROM
        dual
    UNION ALL
    SELECT
        2,
        10,
        12.30569768,
        186.3157839
    FROM
        dual
    UNION ALL
    SELECT
        3,
        93,
        15.29331011,
        117.8718876
    FROM
        dual
    UNION ALL
    SELECT 3,84,
        86.91246955,
        194.1351409
    FROM
        dual
    UNION ALL
    SELECT
        3,
        95,
        74.95595866,
        60.42511642
    FROM
        dual
    UNION ALL
    SELECT
        3,
        48,
        74.26342598,
        196.673485
    FROM
        dual
    UNION ALL
    SELECT
        3,
        10,
        94.60058165,
        130.5963113
    FROM
        dual
    UNION ALL
    SELECT
        3,
        9,
        74.81643457,
        77.17657753
    FROM
        dual
),
aux2 AS (
    SELECT
        client_group,
        col_name,
        value_col
    FROM
        aux UNPIVOT ( value_col
            FOR col_name
        IN ( age,
             height,
             weight ) )
), aux3 AS (
    SELECT DISTINCT
        client_group,
        col_name,
        AVG(value_col) AS mean,
        VARIANCE(value_col) AS variancee,
        MEDIAN(value_col) AS mediann,
        MAX(value_col) AS maximum,
        MIN(value_col) AS minimum
    FROM
        aux2
    GROUP BY
        client_group,
        col_name
)
SELECT
    *
FROM
    aux3 PIVOT (
        MAX ( mean )
    AS mean, MAX ( variancee ) AS variancee, MAX ( mediann ) AS mediann, MAX(maximum) AS maximum, MIN(minimum) AS minimum
        FOR col_name
        IN ( 'HEIGHT' AS "HEIGHT", 'AGE' AS "AGE", 'WEIGHT' AS "WEIGHT", 'MIN' AS "MIN", 'MAX' AS "MAX" )
    );

This approach however would imply much less lines of query in the real scenario, as I only need to add all the desired variables in the aux2 table once. However, this query unpivots a huge wide-format table to long-format, then groups it and then pivots again long-to-wide, and I doubt about the efficiency of that.

So, which approach should I use? I have tried to look at some metrics in the V$SQL special Oracle database table but the queries with GROUP BY clauses do not appear there, so I do not know where to look.

Thenk you very much in advance, if you need further details about anything ask for them and I will provide.


Solution

  • If you want to aggregate multiple rows into one then use GROUP BY and aggregation functions:

    SELECT client_group,
           MIN(age),
           MAX(age),
           AVG(age),
           VARIANCE(age),
           MEDIAN(age),
           MIN(weight),
           MAX(weight),
           AVG(weight),
           VARIANCE(weight),
           MEDIAN(weight),
           MIN(height),
           MAX(height),
           AVG(height),
           VARIANCE(height),
           MEDIAN(height)
    FROM   aux
    GROUP BY
           client_group;
    
    CLIENT_GROUP MIN(AGE) MAX(AGE) AVG(AGE) VARIANCE(AGE) MEDIAN(AGE) MIN(WEIGHT) MAX(WEIGHT) AVG(WEIGHT) VARIANCE(WEIGHT) MEDIAN(WEIGHT) MIN(HEIGHT) MAX(HEIGHT) AVG(HEIGHT) VARIANCE(HEIGHT) MEDIAN(HEIGHT)
    1 12 87 45.33333333333333333333333333333333333333 877.46666666666666666666666666666666666 48.5 5.825719181 196.9548512 82.5287195785 5792.1105324904466527375 63.045361985 44.23760904 90.16203412 62.90592673 278.3366481669938008 59.48004826
    2 10 90 34.16666666666666666666666666666666666667 924.166666666666666666666666666666666667 25 24.42941936 188.3958287 127.794781355 5683.60617887341174847 164.4701555 12.30569768 97.83613127 67.842827805 1078.08620430126785475 74.453734735
    3 9 95 56.5 1612.3 66 60.42511642 196.673485 129.479753125 3267.09420872211803671 124.23409945 15.29331011 94.60058165 70.14036342 789.97206409484731992 74.886196615

    There are no redundant statements in that query as each aggregation function calculates a different metric for a different input column and that is about as succinct as you can get for a single row-per-client_group.

    Another alternative would be to have one row-per-variable-per-client_group:

    SELECT client_group,
           type,
           MIN(value),
           MAX(value),
           AVG(value),
           VARIANCE(value),
           MEDIAN(value)
    FROM   aux
    UNPIVOT(value FOR type IN (age, weight, height))
    GROUP BY
           client_group,
           type;
    

    Which outputs:

    CLIENT_GROUP TYPE MIN(VALUE) MAX(VALUE) AVG(VALUE) VARIANCE(VALUE) MEDIAN(VALUE)
    1 AGE 12 87 45.33333333333333333333333333333333333333 877.46666666666666666666666666666666666 48.5
    1 HEIGHT 44.23760904 90.16203412 62.90592673 278.3366481669938008 59.48004826
    1 WEIGHT 5.825719181 196.9548512 82.5287195785 5792.1105324904466527375 63.045361985
    2 AGE 10 90 34.16666666666666666666666666666666666667 924.166666666666666666666666666666666667 25
    2 HEIGHT 12.30569768 97.83613127 67.842827805 1078.08620430126785475 74.453734735
    2 WEIGHT 24.42941936 188.3958287 127.794781355 5683.60617887341174847 164.4701555
    3 AGE 9 95 56.5 1612.3 66
    3 HEIGHT 15.29331011 94.60058165 70.14036342 789.97206409484731992 74.886196615
    3 WEIGHT 60.42511642 196.673485 129.479753125 3267.09420872211803671 124.23409945

    You get exactly the same data just split into multiple rows for each metric.

    There is no point using the second method and then re-pivoting back to having multiple columns on a single row; that is just doing lots of extra work for no benefit when compared to the first method.

    The explain plan for the first query is:

    Plan hash value: 1572259232
     
    ---------------------------------------------------------------------------
    | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    ---------------------------------------------------------------------------
    |   0 | SELECT STATEMENT   |      |     3 |    63 |     4  (25)| 00:00:01 |
    |   1 |  SORT GROUP BY     |      |     3 |    63 |     4  (25)| 00:00:01 |
    |   2 |   TABLE ACCESS FULL| AUX  |    18 |   378 |     3   (0)| 00:00:01 |
    ---------------------------------------------------------------------------
    

    and compare than to the explain plan for the query:

    EXPLAIN PLAN FOR
    SELECT *
    FROM   (
      SELECT client_group,
             type,
             MIN(value) AS min_value,
             MAX(value) AS max_value,
             AVG(value) AS avg_value,
             VARIANCE(value) AS var_value,
             MEDIAN(value) AS med_value
      FROM   aux
      UNPIVOT INCLUDE NULLS (value FOR type IN (age, weight, height))
      GROUP BY
             client_group,
             type
    )
    PIVOT (
      MAX(min_value) AS min,
      MAX(max_value) AS max,
      MAX(avg_value) AS avg,
      MAX(var_value) AS var,
      MAX(med_value) AS med
      FOR type IN (
        'AGE' AS age,
        'WEIGHT' AS weight,
        'HEIGHT' AS height
      )
    );
    

    is:

    Plan hash value: 4041832950
     
    -----------------------------------------------------------------------------------
    | Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    -----------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT           |      |    54 |  4482 |     6  (17)| 00:00:01 |
    |   1 |  SORT GROUP BY NOSORT PIVOT|      |    54 |  4482 |     6  (17)| 00:00:01 |
    |   2 |   VIEW                     |      |    54 |  4482 |     6  (17)| 00:00:01 |
    |   3 |    SORT GROUP BY           |      |    54 |  1674 |     6  (17)| 00:00:01 |
    |   4 |     VIEW                   |      |    54 |  1674 |     5   (0)| 00:00:01 |
    |   5 |      UNPIVOT               |      |       |       |            |          |
    |   6 |       TABLE ACCESS FULL    | AUX  |    18 |   378 |     3   (0)| 00:00:01 |
    -----------------------------------------------------------------------------------
    

    then you can see how much more work is being done just so you can type a few fewer characters. Just use first method and GROUP BY and either copy-paste the aggregation functions and edit the column name you are aggregating by or write a small SQL script to generate the aggregation query from the data dictionary.

    SELECT 'SELECT client_group'
           || LISTAGG(
                   ', MAX(' || column_name || ')'
                || ', MIN(' || column_name || ')'
                || ', AVG(' || column_name || ')'
                || ', VARIANCE(' || column_name || ')'
                || ', MEDIAN(' || column_name || ')'
              ) WiTHIN GROUP (ORDER BY column_id)
           || ' FROM aux GROUP BY client_group' AS sql
    FROM   user_tab_columns
    WHERE  table_name = 'AUX'
    AND    column_name <> 'CLIENT_GROUP';
    

    Outputs:

    SQL
    SELECT client_group, MAX(AGE), MIN(AGE), AVG(AGE), VARIANCE(AGE), MEDIAN(AGE), MAX(HEIGHT), MIN(HEIGHT), AVG(HEIGHT), VARIANCE(HEIGHT), MEDIAN(HEIGHT), MAX(WEIGHT), MIN(WEIGHT), AVG(WEIGHT), VARIANCE(WEIGHT), MEDIAN(WEIGHT) FROM aux GROUP BY client_group

    (However, you may run into issues if your generated SQL statement is going to be longer than 4000 characters and need to aggregate into a CLOB - but there are ways around that.)

    fiddle