Search code examples
rdata.tablesample

R Sample By Minimum Cell Size


set.seed(1)
data=data.frame(SCHOOL = rep(1:10, each = 1000), GRADE = sample(7:12, r = T, size = 10000),SCORE = sample(1:100, r = T, size = 10000))

I have 'data' that contains information about student test score. I wish to: count how many GRADE for each SCHOOL, and then take the smallest value of GRADE for all SCHOOLS. Like this:

For each SCHOOL count the number of rows for a specific GRADE. Then for each GRADE find the smallest values across all SCHOOLs. Finally I wish to take a random sample based on the smallest value found in step 2.

So basically in this basic example with two SCHOOLs and GRADE 7 and GRADE 8: enter image description here

SCHOOL 1 has 2 SCOREs for GRADE 7 and SCHOOL 1 has 3 SCOREs for GRADE 8.

SCHOOL 2 has 1 SCOREs for GRADE 7 and SCHOOL 2 has 4 SCOREs for GRADE 8.

So the new data contains one SCORE for GRADE 7 from SCHOOL 1 and SCHOOL 2, and three SCORE for GRADE 8 from SCHOOL 1 and SCHOOL 2 and these SCORE that are picked are RANDOMLY SAMPLED.

like this:

enter image description here

My attempt: data[, .SD[sample(x = .N, size = min(sum(GRADE), .N))], by = .(SCHOOL,GRADE]


Solution

  • This follows your description of how to do it step-by-step.

    library(data.table)
    setDT(data)
    data[, N := .N, .(SCHOOL, GRADE)]
    data[, N := min(N), GRADE]
    data[, .(SCORE = sample(SCORE, N)), .(SCHOOL, GRADE, N)][, -'N']
    

    If you have multiple SCORE-like columns and you want keep the same rows from each then you can use .SD like in your attempt:

    data[, .SD[sample(.N, N)], .(SCHOOL, GRADE, N)][, -'N']