I am trying to figure out how to randomly sample IDs matching a group variable, but specifically randomly sampling by weighting each grouping variable. For instance, take this data:
I want to weight Group A at 60%, Group B at 20%, Group C at 15%, and Group D at 5%, and then randomly sample 10 IDs, without duplicates, according to those weights, returning those IDs in another column.
Hopefully this makes sense and thank you for any help in advance!
I have figured out random sampling generally but am struggling with the weighting and not sampling duplicates.
You could use the following:
=LET(a,{"A";"B";"C";"D"},
b,{60;20;15;5},
c,TOCOL(IFS(SEQUENCE(,SUM(b))<=LOOKUP(B2:B25,HSTACK(a,b)),A2:A25),2),
TAKE(UNIQUE(SORTBY(c,RANDARRAY(ROWS(c)))),10))
It first declares a
to be your groups and it's weight b
.
c
looks up the values in column B in array a
and returns it's weight b
. If the returned value is smaller than or equal to the sequence of the sum of b
it returns the value of column A, else it returns an error. This results in the values of column A being repeated the number of times mentioned in b
where the value in column B matches the value of a
.
TOCOL removes the errors and makes all values become a one column array.
This is sorted randomly and the first 10 values of the uniques are taken.