I have a dataset with many rows (~500000). A column "X" of this dataset has a mean value of 4.5. I would like to sample the dataset (without replacement) to have approximately 50000 rows and at the same time to reach a mean value of "X" approximately of 3.5.
How would I do that in R in way that is reasonably fast?
Since the OP's only criteria is to have sample mean to be close to 3.5 without consideration of dispersion, here is a possible approach:
Code:
library(data.table)
nr <- 5e5
ns <- 5e4
DT <- data.table(X=rnorm(nr, 4.5))
target <- 3.5
dev <- 0.05
setorder(DT[, absDev := abs(X - target)], absDev)
DT[, cummean := cumsum(X) / seq_len(.N)]
x <- DT[(target-dev) <= cummean & cummean <= (target+dev), sample(X, ns)]
mean(x)
#[1] 3.549371