set.seed(1)
data=data.frame(SCHOOL = rep(1:10, each = 1000), GRADE = sample(7:12, r = T, size = 10000),SCORE = sample(1:100, r = T, size = 10000))
I have 'data' that contains information about student test score. I wish to: count how many GRADE for each SCHOOL, and then take the smallest value of GRADE for all SCHOOLS. Like this:
For each SCHOOL count the number of rows for a specific GRADE. Then for each GRADE find the smallest values across all SCHOOLs. Finally I wish to take a random sample based on the smallest value found in step 2.
So basically in this basic example with two SCHOOLs and GRADE 7 and GRADE 8:
SCHOOL 1 has 2 SCOREs for GRADE 7 and SCHOOL 1 has 3 SCOREs for GRADE 8.
SCHOOL 2 has 1 SCOREs for GRADE 7 and SCHOOL 2 has 4 SCOREs for GRADE 8.
So the new data contains one SCORE for GRADE 7 from SCHOOL 1 and SCHOOL 2, and three SCORE for GRADE 8 from SCHOOL 1 and SCHOOL 2 and these SCORE that are picked are RANDOMLY SAMPLED.
like this:
My attempt: data[, .SD[sample(x = .N, size = min(sum(GRADE), .N))], by = .(SCHOOL,GRADE]
This follows your description of how to do it step-by-step.
library(data.table)
setDT(data)
data[, N := .N, .(SCHOOL, GRADE)]
data[, N := min(N), GRADE]
data[, .(SCORE = sample(SCORE, N)), .(SCHOOL, GRADE, N)][, -'N']
If you have multiple SCORE
-like columns and you want keep the same rows from each then you can use .SD
like in your attempt:
data[, .SD[sample(.N, N)], .(SCHOOL, GRADE, N)][, -'N']