I have a table containing data that has a column named id
that looks like below:
id | value 1 | value 2 | value 3 |
---|---|---|---|
1 | 244 | 550 | 1000 |
1 | 251 | 551 | 700 |
1 | 540 | 60 | 1200 |
... | ... | ... | ... |
2 | 19 | 744 | 2000 |
2 | 10 | 903 | 100 |
2 | 44 | 231 | 600 |
2 | 120 | 910 | 1100 |
... | ... | ... | ... |
I want to take 50 sample rows per id
that exists but if less than 50 exist for the group to simply take the entire set of data points.
For example I would like a maximum 50 data points randomly selected from id = 1
, id = 2
etc...
I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id
and limit to 50:
SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;
But it's obvious that you cannot use this type of solution because UNION ALL
requires aggregating outputs from one id to the next and I do not have a list of id
values to use in place of X in tbl.id = X
.
Is there a way to accomplish this by gathering that list of unique id
values and union all results or is there a more optimal way this could be done?
In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -
From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."
Query -
with data as (
select row_number() over (partition by id order by random()) rn,
* from table_name)
select * from data where rn<=50 order by id;