Search code examples
postgresqlrandomreplacesampling

Sampling in PostgreSQL


I am looking for possible ways of random sampling in PostgreSQL. I found a couple of methods to do that with different advantages and disadvantages. The naive way to do that is:

select * from Table_Name
order by random()
limit 10;

Another faster method is:

select * from Table_Name
WHERE random() <= 0.01
order by random()
limit 10;

(Although that 0.01 depends on the table size and the sample size; this is just an example.)

In both of these queries a random number is generated for each row and sorted based on those random generated numbers. Then in the sorted numbers the first 10 are selected as the final result, so I think these should be sampling without replacement.

Now what I want to do is to somehow turn this sampling methods into sampling with replacement. How is that possible? Or is there any other random sampling method with replacement in PostgreSQL?

I have to say that I do have an idea how this might be possible but I don't know how to implement it in postgresql, Here is my idea:
If instead of generating one random value we generate S random values where S is the sample size,then order all of the random generated values,it will be sampling with replacement.(I don't know if I am right)
At this point I don't mind about the performance of the query.


Solution

  • This could be achieved by mapping the random values to row numbers. The same row could be sampled N times if it happens that the same corresponding random number comes out N times. Here's a CTE implementation:

    with 
     rows as (select *,row_number() over() as rn from tablename order by random()),
     w(num) as (select (random()*(select count(*) from rows))::int+1
                  from generate_series(1,10)) 
    select rows.* from rows join w on rows.rn = w.num;