I want to randomly sample some subset of my data, using a random seed so that it will be repeatable. Currently, I have this working minus the seed:
select * from my_table
where version is not null
and start_datetime::date >= date('2020-03-16')
and start_datetime::date < date('2020-05-15')
order by random()
limit 10000
Now I'd like to set a random seed so I can reliably get the same results from this query.
Is there a good way to do this?
One option uses random function setseed()
. As explained in the documentation: once this function is called, the results of subsequent random()
calls in the current session are repeatable by re-issuing setseed()
with the same argument.
The technique is to include the call to the function directly in your query, using union all
, and then sort in the outer query. This requires listing the columns that you want to return from the query. Assuming that you want columns col1, col2, col3
, your query would be:
select *
from (
select setseed(0.5) colx, null col1, null col2, null col3
union all
select null, col1, col2, col3
from mytable
where
rpt.assist_detail.version is not null
and start_datetime::date >= date '2020-03-16'
and start_datetime::date < date '2020-05-15'
offset 1
) t
order by random()
limit 10000
offset 1
is there to drop the row generated by the first subquery. The argument of setseed()
(here 0.5
) may be any arbitrary value between -1
and 1
. As long as you pass the same value, you get the same sort.