Search code examples
postgresqlgroupingsampling

Taking N-samples from each group in PostgreSQL


I have a table containing data that has a column named id that looks like below:

id value 1 value 2 value 3
1 244 550 1000
1 251 551 700
1 540 60 1200
... ... ... ...
2 19 744 2000
2 10 903 100
2 44 231 600
2 120 910 1100
... ... ... ...

I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points.

For example I would like a maximum 50 data points randomly selected from id = 1, id = 2 etc...

I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id and limit to 50:

SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;

But it's obvious that you cannot use this type of solution because UNION ALL requires aggregating outputs from one id to the next and I do not have a list of id values to use in place of X in tbl.id = X.

Is there a way to accomplish this by gathering that list of unique id values and union all results or is there a more optimal way this could be done?


Solution

  • In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -

    From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."

    Query -

    with data as (
    select row_number() over (partition by id order by random()) rn,
    * from table_name)
    select * from data where rn<=50 order by id;
    

    Fiddle.