Search code examples
rconditional-statementssampling

Sample a specific number of tibble rows conditional to a variable/column reaching a certain mean value


I have a dataset with many rows (~500000). A column "X" of this dataset has a mean value of 4.5. I would like to sample the dataset (without replacement) to have approximately 50000 rows and at the same time to reach a mean value of "X" approximately of 3.5.

How would I do that in R in way that is reasonably fast?


Solution

  • Since the OP's only criteria is to have sample mean to be close to 3.5 without consideration of dispersion, here is a possible approach:

    1. calculate the deviation from 3.5,
    2. sort the data by this deviation,
    3. calculate the cumulative mean of X sorted by absolute deviation from 3.5,
    4. subset the data to have cumulative mean to be around 3.5 before sampling the data.

    Code:

    library(data.table)
    nr <- 5e5
    ns <- 5e4
    DT <- data.table(X=rnorm(nr, 4.5))
    
    target <- 3.5
    dev <- 0.05
    setorder(DT[, absDev := abs(X - target)], absDev)
    DT[, cummean := cumsum(X) / seq_len(.N)]
    x <- DT[(target-dev) <= cummean & cummean <= (target+dev), sample(X, ns)]
    mean(x)
    #[1] 3.549371