Search code examples
oracle-databasesamplingstatistics-bootstrap

Sampling with Replacement/Bootstrap in Oracle SQL


This is my first question to the stack overflow.

We are doing statistics for simulation data and I wanted to implement the bootstrap in Oracle-SQL (sampling data with replacement). My approach is as follows:

  1. Number all rows in the query
  2. Create a list of random numbers within the row range
  3. For each random number, join a row. -> i.e. sampling with replacement

However, there seems to be a problem with joins. Oracle is losing some rows.

Here is an example. When I do the following query, I get a random ordering of the rows:

    WITH basis as (
         SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
        union all
        SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
        union all
        SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
    ), group_counts AS (
      SELECT KPI_KEY, COUNT(*) as total_count
      FROM basis
      GROUP BY KPI_KEY
    )
    , random_numbers AS (
      SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
      FROM group_counts
      GROUP BY KPI_KEY
      CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
    )
    select * from random_numbers
    ;

Result (correct): KPI_KEY RAND_RN 'A' 1 'A' 2 'A' 2

Now, when I join the numbered rows like the following, the result is wrong:

    WITH basis as (
         SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
        union all
        SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
        union all
        SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
    ), group_counts AS (
      SELECT KPI_KEY, COUNT(*) as total_count
      FROM basis
      GROUP BY KPI_KEY
    )
    , numbered_rows AS (
      SELECT KPI_KEY, KPI_VALUE, ROW_NUMBER() 
        OVER (PARTITION BY KPI_KEY ORDER BY dbms_random.value) AS rn, total_count
      FROM basis
      JOIN group_counts USING (KPI_KEY)
    )
    , random_numbers AS (
      SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
      FROM group_counts
      GROUP BY KPI_KEY
      CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
    )
    SELECT rn.KPI_KEY, nr.KPI_VALUE, nr.rn
    FROM random_numbers rn
    LEFT JOIN numbered_rows nr 
        ON rn.KPI_KEY = nr.KPI_KEY 
        and rn.rand_rn = nr.rn
    ;

Result (value is random, but missing rows from join): KPI_KEY KPI_VALUE RN 'A' 5 3

Someone who has had a similar problem before? Seems like an Oracle bug to me, but maybe I missed an important detail.

I tried the materialized hint, and also the rownum approach found here: Random join in oracle


Solution

  • The issue is that the GROUP BY of the penultimate query is aggregating all the generated rows into a single row and the CONNECT BY clause effectively does unnecessary work for zero benefit.

    This appears to be because the SQL engine in rewriting the query and if you take the output from the penultimate query, without the ultimate query, then it performs GROUP BY then CONNECT BY and if you perform both the penultimate and ultimate queries then it performs CONNECT BY and then GROUP BY (in the reverse order).

    EXPLAIN PLAN FOR
      WITH basis as (
             SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
            union all
            SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
            union all
            SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
        ), group_counts AS (
          SELECT KPI_KEY, COUNT(*) as total_count
          FROM basis
          GROUP BY KPI_KEY
        )
        , numbered_rows AS (
          SELECT KPI_KEY, KPI_VALUE, ROW_NUMBER() 
            OVER (PARTITION BY KPI_KEY ORDER BY dbms_random.value) AS rn, total_count
          FROM basis
          JOIN group_counts USING (KPI_KEY)
        )
        , random_numbers AS (
          SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
          FROM group_counts
          GROUP BY KPI_KEY
          CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
        )
        SELECT *
        FROM random_numbers rn
        ;
    

    Gives the plan:

    PLAN_TABLE_OUTPUT
    Plan hash value: 448364244
     
    --------------------------------------------------------------------------------------
    | Id  | Operation                     | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT              |      |     3 |    48 |     7  (15)| 00:00:01 |
    |   1 |  VIEW                         |      |     3 |    48 |     7  (15)| 00:00:01 |
    |*  2 |   CONNECT BY WITHOUT FILTERING|      |       |       |            |          |
    |   3 |    VIEW                       |      |     3 |    48 |     7  (15)| 00:00:01 |
    |   4 |     HASH GROUP BY             |      |     3 |     9 |     7  (15)| 00:00:01 |
    |   5 |      VIEW                     |      |     3 |     9 |     6   (0)| 00:00:01 |
    |   6 |       UNION-ALL               |      |       |       |            |          |
    |   7 |        FAST DUAL              |      |     1 |       |     2   (0)| 00:00:01 |
    |   8 |        FAST DUAL              |      |     1 |       |     2   (0)| 00:00:01 |
    |   9 |        FAST DUAL              |      |     1 |       |     2   (0)| 00:00:01 |
    --------------------------------------------------------------------------------------
     
    Predicate Information (identified by operation id):
    ---------------------------------------------------
     
       2 - access("KPI_KEY"=PRIOR "KPI_KEY")
           filter("TOTAL_COUNT">=LEVEL AND PRIOR "DBMS_RANDOM"."VALUE"() IS NOT
                  NULL)
    

    And:

    EXPLAIN PLAN FOR
    WITH basis as (
             SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
            union all
            SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
            union all
            SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
        ), group_counts AS (
          SELECT KPI_KEY, COUNT(*) as total_count
          FROM basis
          GROUP BY KPI_KEY
        )
        , numbered_rows AS (
          SELECT KPI_KEY, KPI_VALUE, ROW_NUMBER() 
            OVER (PARTITION BY KPI_KEY ORDER BY dbms_random.value) AS rn, total_count
          FROM basis
          JOIN group_counts USING (KPI_KEY)
        )
        , random_numbers AS (
          SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
          FROM group_counts
          GROUP BY KPI_KEY
          CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
        )
        SELECT rn.*, nr.*, ROWNUM
        FROM random_numbers rn
        FULL OUTER JOIN numbered_rows nr 
            ON rn.KPI_KEY = nr.KPI_KEY 
            and rn.rand_rn = nr.rn;
    

    Gives the plan:

    Plan hash value: 1952896670
     
    -----------------------------------------------------------------------------------------------------------------------
    | Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
    -----------------------------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                         |                            |     2 |    96 |    17  (18)| 00:00:01 |
    |   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
    |   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D698B_58974E7 |       |       |            |          |
    |   3 |    UNION-ALL                             |                            |       |       |            |          |
    |   4 |     FAST DUAL                            |                            |     1 |       |     2   (0)| 00:00:01 |
    |   5 |     FAST DUAL                            |                            |     1 |       |     2   (0)| 00:00:01 |
    |   6 |     FAST DUAL                            |                            |     1 |       |     2   (0)| 00:00:01 |
    |   7 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D698C_58974E7 |       |       |            |          |
    |   8 |    HASH GROUP BY                         |                            |     1 |     3 |     3  (34)| 00:00:01 |
    |   9 |     VIEW                                 |                            |     3 |     9 |     2   (0)| 00:00:01 |
    |  10 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D698B_58974E7 |     3 |    48 |     2   (0)| 00:00:01 |
    |  11 |   COUNT                                  |                            |       |       |            |          |
    |  12 |    VIEW                                  | VW_FOJ_0                   |     2 |    96 |     8  (25)| 00:00:01 |
    |* 13 |     HASH JOIN FULL OUTER                 |                            |     2 |    96 |     8  (25)| 00:00:01 |
    |  14 |      VIEW                                |                            |     1 |    16 |     3  (34)| 00:00:01 |
    |  15 |       HASH GROUP BY                      |                            |     1 |    16 |     3  (34)| 00:00:01 |
    |* 16 |        CONNECT BY WITHOUT FILTERING      |                            |       |       |            |          |
    |  17 |         VIEW                             |                            |     3 |    48 |     2   (0)| 00:00:01 |
    |  18 |          TABLE ACCESS FULL               | SYS_TEMP_0FD9D698C_58974E7 |     3 |     9 |     2   (0)| 00:00:01 |
    |  19 |      VIEW                                |                            |     1 |    32 |     5  (20)| 00:00:01 |
    |  20 |       WINDOW SORT                        |                            |     1 |    22 |     5  (20)| 00:00:01 |
    |* 21 |        HASH JOIN                         |                            |     1 |    22 |     4   (0)| 00:00:01 |
    |  22 |         VIEW                             |                            |     3 |    18 |     2   (0)| 00:00:01 |
    |  23 |          TABLE ACCESS FULL               | SYS_TEMP_0FD9D698B_58974E7 |     3 |    48 |     2   (0)| 00:00:01 |
    |  24 |         VIEW                             |                            |     3 |    48 |     2   (0)| 00:00:01 |
    |  25 |          TABLE ACCESS FULL               | SYS_TEMP_0FD9D698C_58974E7 |     3 |     9 |     2   (0)| 00:00:01 |
    -----------------------------------------------------------------------------------------------------------------------
     
    Predicate Information (identified by operation id):
    ---------------------------------------------------
     
      13 - access("RN"."KPI_KEY"="NR"."KPI_KEY" AND "RN"."RAND_RN"="NR"."RN")
      16 - access("KPI_KEY"=PRIOR "KPI_KEY")
           filter("TOTAL_COUNT">=LEVEL AND PRIOR "DBMS_RANDOM"."VALUE"() IS NOT NULL)
      21 - access("BASIS"."KPI_KEY"="GROUP_COUNTS"."KPI_KEY")
    

    It may be possible to fix the query and materialise the output from the penultimate query so that the EXPLAIN PLAN is consistently generated with the ultimate part of your query; however, this seems to be a fragile solution and you would probably be better to rewrite the query entirely to use something that is (a) simpler and (b) less impacted by the compiler rewriting the query.


    Assuming that your logic is:

    1. For each kpi_key:

      1. Assign each row a unique row number.
      2. Assign each row a random row number from 1 up to the total number of rows for that kpi_key (there is a bug in your logic using CEIL(DBMS_RANDOM.VALUE(0, num_rows)) as DBMS_RANDOM will include the lower-bound and exclude the upper-bound so you can get values starting from 0, which you do not want to include, and rounding up to num_rows but the probabilities are not even as the upper-bound is excluded; instead you want to use FLOOR and add 1).
    2. Self-join the rows so that each row is joined to its random counter-part.

    Then you can rewrite the query using analytic functions:

    WITH basis (kpi_key, kpi_value) as (
      SELECT 'A', 2 FROM dual union all
      SELECT 'A', 5 FROM dual union all
      SELECT 'A', 3 FROM dual
    ),
    random_numbers AS (
      SELECT kpi_key,
             kpi_value,
             1 + FLOOR(DBMS_RANDOM.VALUE(0, COUNT(*) OVER (PARTITION BY kpi_key))) AS rand_rn,
             ROW_NUMBER() OVER (PARTITION BY kpi_key ORDER BY ROWNUM) AS rn
      FROM   basis
    )
    select original.kpi_key   AS o_kpi_key,
           original.kpi_value AS o_kpi_value,
           original.rn        AS o_rn,
           rand.kpi_key       AS r_kpi_key,
           rand.kpi_value     AS r_kpi_value,
           rand.rn            AS r_rn
    from   random_numbers original
           INNER JOIN random_numbers rand
           ON original.rand_rn = rand.rn;
    

    Note: There is no point randomly ordering the ROW_NUMBER output as you are comparing it to a random value and you will get the same randomness comparing a static-ordered list to random values within that list as you would when comparing a randomly-ordered list to random values within that list.

    Which may randomly output:

    O_KPI_KEY O_KPI_VALUE O_RN R_KPI_KEY R_KPI_VALUE R_RN
    A 5 2 A 3 3
    A 2 1 A 3 3
    A 3 3 A 2 1

    Note: You probably just want the last 3 columns corresponding to the random row selection.

    fiddle