I am looking for possible ways of random sampling in PostgreSQL. I found a couple of methods to do that with different advantages and disadvantages. The naive way to do that is:
select * from Table_Name
order by random()
limit 10;
Another faster method is:
select * from Table_Name
WHERE random() <= 0.01
order by random()
limit 10;
(Although that 0.01 depends on the table size and the sample size; this is just an example.)
In both of these queries a random number is generated for each row and sorted based on those random generated numbers. Then in the sorted numbers the first 10 are selected as the final result, so I think these should be sampling without replacement.
Now what I want to do is to somehow turn this sampling methods into sampling with replacement. How is that possible? Or is there any other random sampling method with replacement in PostgreSQL?
I have to say that I do have an idea how this might be possible but I don't know how to implement it in postgresql, Here is my idea:
If instead of generating one random value we generate S random values where S is the sample size,then order all of the random generated values,it will be sampling with replacement.(I don't know if I am right)
At this point I don't mind about the performance of the query.
This could be achieved by mapping the random values to row numbers. The same row could be sampled N times if it happens that the same corresponding random number comes out N times. Here's a CTE implementation:
with
rows as (select *,row_number() over() as rn from tablename order by random()),
w(num) as (select (random()*(select count(*) from rows))::int+1
from generate_series(1,10))
select rows.* from rows join w on rows.rn = w.num;