Sample a specific number of tibble rows conditional to a variable/column reaching a certain mean value

I have a dataset with many rows (~500000). A column "X" of this dataset has a mean value of 4.5. I would like to sample the dataset (without replacement) to have approximately 50000 rows and at the same time to reach a mean value of "X" approximately of 3.5.

How would I do that in R in way that is reasonably fast?

Solution

Since the OP's only criteria is to have sample mean to be close to 3.5 without consideration of dispersion, here is a possible approach:

calculate the deviation from 3.5,
sort the data by this deviation,
calculate the cumulative mean of X sorted by absolute deviation from 3.5,
subset the data to have cumulative mean to be around 3.5 before sampling the data.

Code:

library(data.table)
nr <- 5e5
ns <- 5e4
DT <- data.table(X=rnorm(nr, 4.5))

target <- 3.5
dev <- 0.05
setorder(DT[, absDev := abs(X - target)], absDev)
DT[, cummean := cumsum(X) / seq_len(.N)]
x <- DT[(target-dev) <= cummean & cummean <= (target+dev), sample(X, ns)]
mean(x)
#[1] 3.549371