Search code examples
sqloracle-databasesampledata-generationpopulation

Random data sampling with oracle sql, data generation


i need to generate some sample data from a population. I want to do this with an SQL query on an Oracle 11g database.

Here is a simple working example with population size 4 and sample size 2:

with population as (
    select 1 as val from dual union all 
    select 2 from dual union all 
    select 3 from dual union all 
    select 4 from dual)

select val from (
    select val, dbms_random.value(0,10) AS RANDORDER 
    from  population 
    order by randorder) 
where rownum <= 2

(the oracle sample() funtion didn't work in connection with the WITH-clause for me)

But now I, I want to "upscale" or multiply my sample data. So that I can get something like 150 % sample data of the population data (population size 4 and sample size 6, e.g.)

Is there a good way to achieve this with an SQL query?


Solution

  • The solution depends, if you want all rows from first initial set(s) and random additional rows from last one then use:

    with params(size_, sample_) as (select 4, 6 from dual)
    select val 
      from (
        select mod(level - 1, size_) + 1 val, sample_,
               case when level <= size_ * floor(sample_ / size_) then 0 
                    else dbms_random.value() 
               end rand
          from params
          connect by level <= size_ * ceil(sample_ / size_)
          order by rand)
      where rownum <= sample_
    

    But if you allow possibility of result like (1, 1, 2, 2, 3, 3), where some values may not appear at all in output (here 4) then use this:

    with params(size_, sample_) as (select 4, 6 from dual)
    select val 
      from (
        select mod(level - 1, size_) + 1 val, sample_, dbms_random.value() rand
          from params
          connect by level <= size_ * ceil(sample_ / size_)
          order by rand)
      where rownum <= sample_
    

    How it works? We build set of (1, 2, 3, 4) as many times as it results from division sample / size. Then we assign random values. In first case I assign 0 to first set(s), so they will be in output for sure, and random values to last set. In second case randoms are assigned to all rows.