Search code examples
sqlpostgresqlrandomsql-order-bysql-limit

Repeatable Pseudorandom Sample of rows in PostgresSQL


I want to randomly sample some subset of my data, using a random seed so that it will be repeatable. Currently, I have this working minus the seed:

select * from my_table
where version is not null
  and start_datetime::date >= date('2020-03-16')
  and start_datetime::date < date('2020-05-15')
order by random()
limit 10000

Now I'd like to set a random seed so I can reliably get the same results from this query.

Is there a good way to do this?


Solution

  • One option uses random function setseed(). As explained in the documentation: once this function is called, the results of subsequent random() calls in the current session are repeatable by re-issuing setseed() with the same argument.

    The technique is to include the call to the function directly in your query, using union all, and then sort in the outer query. This requires listing the columns that you want to return from the query. Assuming that you want columns col1, col2, col3, your query would be:

    select * 
    from (
        select setseed(0.5) colx, null col1, null col2, null col3
        union all
        select null, col1, col2, col3
        from mytable
        where 
            rpt.assist_detail.version is not null
            and start_datetime::date >= date '2020-03-16'
            and start_datetime::date <  date '2020-05-15'
        offset 1
    ) t
    order by random()
    limit 10000
    

    offset 1 is there to drop the row generated by the first subquery. The argument of setseed() (here 0.5) may be any arbitrary value between -1 and 1. As long as you pass the same value, you get the same sort.