I have data with the following structure:
CREATE TABLE if not EXISTS scores (
id int,
class char,
score float
);
INSERT INTO scores VALUES
(1, 'A', 0.5),
(1, 'B', 0.2),
(1, 'C', 0.1),
(2, 'A', 0.1),
(2, 'B', 0.2),
(3, 'D', 0.01),
(4, 'A', 0.5),
(4, 'B', 0.5);
I want to randomly sample, for each id, a class. A possible sample obtained would be:
1,'A'
2,'B'
3,'D'
4,'A'
The logic I want to use for the sample is the following: each class is sampled proportionally to its score. For instance:
id = 1
, sampling class 'B' should be twice as likely to be sampled than class 'C'.id = 2
, sampling class 'B' should be twice as likely to be sampled than class 'A'.id = 3
, we should only sample class 'D'.id = 4
, sampling class 'B' should be as likely as sampling class 'A'.I'm looking for ways of doing this in BigQuery/PostgreSQL. Also, are there solutions with a fixed random seed that can be reproduced?
Thanks!
Possible approach is to generate for each 'id-class' pair a number of rows equivalent to the score (50 '1-A' rows, 20 '1-B' rows, 10 '1-C' rows, etc...) and then select 1 row randomly per id.
For BigQuery:
select id, array_agg(class order by rand() limit 1)[offset(0)]
from scores, unnest(generate_array(1, score * 100))
group by id